--- categories: - "" - "" date: "2020-09-14T22:42:51-05:00" description: Regression and Prediction Model of AirBnB Prices in Copenhagen draft: false image: copenhagen.jpg keywords: "" slug: blog4 title: How much Does it Cost to Sleep in Copenhagen? ---

AirBnB prices in Copenhagen

The purpose of this report is to explore data on AirBnB listings in Copenhagen and understand the key drivers of price variability, specifically focused on the price for a 4-night stay for 2 guests.

Before going into detail into my analyses showing the code, I will quickly summarize the problem, approach and findings of the project:

The Problem

What is the best model to predict overall price of AirBnB accommodation in Copenhagen for 2 people wishing to spend 4 nights in the city?

Applied methodologies

The research is based on conducting Exploratory Data Analysis, followed by Geospatial Analysis to gain more context on the variables of interest and their spatial, as well as statistical, distributions, before building, evaluating and ultimately selecting an optimal regression model to predict the price of an AirBnB stay in the city.

My Approach and Findings

Exploratory Data Analysis

This chapter presents the very first steps I took to tidy the data and understand its structure.

  • The data needed wrangling, since quite a few variables were not of the appropriate data type. Furthermore, a significant number of variables were either incomplete or not readily interpretable. I have adjusted the levels of some variables, clustering and recategorising them; created new cleaned variables; ensured NA values were correctly coded as such; and imputed NA’s using my judgment, wherever necessary.

  • After having selected meaningful variables, summary statistics were created and then visualized in tables, density plots, bar plots and correlation matrices. In doing so, I found some correlations between review-related variables and listing type-related variables.

Mapping

This chapter describes geospatial analyses conducted by using the leaflet package. These mapped visualizations were created in order to glean a more holistic perspective not only of the statistical distribution of listings across certain variables and levels, but the spatial distribution of listings with certain characteristics (such as price and type). My maps include: Nightly Prices in Copenhagen, a Clustered map of AirBnB accommodation, and a comprehensive Map of nightly price, grouped by property type. I summarize some of my key geospatial findings below:

  • Nightly Prices in Copenhagen Map: listings are highly priced in central areas, especially on the inland waterways and Eastern coastline, such as the lakes and Amager

  • Clustered map of AirBnB accommodation: the vast majority of listings, as I would expect, are in the city center and in key areas around the center such as Norrebro, Osterbro, Vesterbro and Fredriksberg

  • Map by property type: the most common property type is Apartment; Apartment and Condominium prices decline with distance to the city center while House and Townhouse prices are not overtly correlated with centrality

Regression Analysis

This chapter focuses on describing the process behind building the optimal model, guiding the reader through 7 preliminary models before identifying, specifying, testing and comparing linear regression model performance. I conclude by identifying a model which I find to explain the greatest proportion of variance in price of AirBnB listings while upholding core linear regression assumptions and showing the least overfitting of all tested models.

My Best Model

In summary, my “best”, or optimal, model was generated via the following core steps:

  • The data was split into training and testing data with corresponding ratio of 3:1.
  • The best fitted model includes variables across 4 key categories: Facility & Location, Host & Reviews, Flexibility and Amenities and has an Adjusted R2 of 46.4%
  • The model showed the least overfitting of all models tested, and showed no concerning collinearity
  • The point estimate generated by the optimal model for the price of a 4-night stay in Copenhagen for 2 people was $1949 with a confidence interval of $1309 to $2597

Detailed Analyses and Code

Loading Data

As first step I load the required packages for all stages of my analysis.

Next, I load in the Copenhagen AirBnB data, which has been scraped from the AirBnB website using vroom, which is a package allowing for fast reading of large datasets such as ours. I also clean the variable names as I import the data, to allow us to push on with my EDA without extensive renaming.

Exploratory Data Analysis

I conduct a thorough Exploratory Data Analysis (EDA), using questions to guide my investigation. In this phase of my analyses, I not only attempt to gain an understanding of my data, but also manipulate or ‘wrangle’ it so that each variable behaves as it should, and there is no unnecessary data missingness or structural problems. This allows us to dive deep into my data and uncover hidden relationships without fear of unexpected errors and without being constrained by poor data coding, since my dataset was scraped from AirBnB’s public website.

My Goals During EDA

My primary focus during EDA is to develop an understanding of the data. I am guided by the following questions for my investigations:

Raw Values

I use glimpse to have a first look at the dataset, in order to evaluate all columns and the data types. Then, I skim the data to get a better picture of the dataset:

## Rows: 28,523
## Columns: 106
## $ id                                           <dbl> 6983, 26057, 26473, 29...
## $ listing_url                                  <chr> "https://www.airbnb.co...
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped                                 <date> 2020-06-28, 2020-06-2...
## $ name                                         <chr> "Copenhagen 'N Livin'"...
## $ summary                                      <chr> "Lovely apartment loca...
## $ space                                        <chr> "Beautiful and cosy ap...
## $ description                                  <chr> "Lovely apartment loca...
## $ experiences_offered                          <chr> "none", "none", "none"...
## $ neighborhood_overview                        <chr> "Nice bars and cozy ca...
## $ notes                                        <chr> NA, NA, NA, NA, "Pleas...
## $ transit                                      <chr> "Bus 66 runs to the ce...
## $ access                                       <chr> "Bedroom, living room,...
## $ interaction                                  <chr> "We are usually at wor...
## $ house_rules                                  <chr> "No smoking allowed! N...
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url                                  <chr> "https://a0.muscache.c...
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id                                      <dbl> 16774, 109777, 112210,...
## $ host_url                                     <chr> "https://www.airbnb.co...
## $ host_name                                    <chr> "Simon", "Kari", "Oliv...
## $ host_since                                   <date> 2009-05-12, 2010-04-1...
## $ host_location                                <chr> "Copenhagen, Capital R...
## $ host_about                                   <chr> "I'm currently working...
## $ host_response_time                           <chr> "N/A", "N/A", "within ...
## $ host_response_rate                           <chr> "N/A", "N/A", "100%", ...
## $ host_acceptance_rate                         <chr> "33%", "19%", "100%", ...
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url                           <chr> "https://a0.muscache.c...
## $ host_picture_url                             <chr> "https://a0.muscache.c...
## $ host_neighbourhood                           <chr> "Nørrebro", "Indre By"...
## $ host_listings_count                          <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_total_listings_count                    <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_verifications                           <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified                       <lgl> FALSE, FALSE, TRUE, FA...
## $ street                                       <chr> "Copenhagen, Hovedstad...
## $ neighbourhood                                <chr> "Nørrebro", "Indre By"...
## $ neighbourhood_cleansed                       <chr> "Nrrebro", "Indre By",...
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA...
## $ city                                         <chr> "Copenhagen", "Copenha...
## $ state                                        <chr> "Hovedstaden", "Hoveds...
## $ zipcode                                      <dbl> 2200, 2100, 1210, 1650...
## $ market                                       <chr> "Copenhagen", "Copenha...
## $ smart_location                               <chr> "Copenhagen, Denmark",...
## $ country_code                                 <chr> "DK", "DK", "DK", "DK"...
## $ country                                      <chr> "Denmark", "Denmark", ...
## $ latitude                                     <dbl> 55.7, 55.7, 55.7, 55.7...
## $ longitude                                    <dbl> 12.5, 12.6, 12.6, 12.6...
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, TRUE...
## $ property_type                                <chr> "Apartment", "House", ...
## $ room_type                                    <chr> "Private room", "Entir...
## $ accommodates                                 <dbl> 2, 6, 12, 2, 4, 3, 3, ...
## $ bathrooms                                    <dbl> 1.0, 1.5, 2.5, 1.0, 1....
## $ bedrooms                                     <dbl> 1, 4, 6, 1, 3, 1, 1, 2...
## $ beds                                         <dbl> 1, 4, 7, 1, 3, 3, 2, 2...
## $ bed_type                                     <chr> "Real Bed", "Real Bed"...
## $ amenities                                    <chr> "{TV,\"Cable TV\",Wifi...
## $ square_feet                                  <dbl> 97, NA, NA, NA, NA, 68...
## $ price                                        <chr> "$365.00", "$2,398.00"...
## $ weekly_price                                 <chr> NA, NA, "$17,513.00", ...
## $ monthly_price                                <chr> NA, NA, "$67,073.00", ...
## $ security_deposit                             <chr> "$0.00", "$5,000.00", ...
## $ cleaning_fee                                 <chr> "$33.00", "$1,100.00",...
## $ guests_included                              <dbl> 1, 3, 1, 1, 1, 2, 2, 2...
## $ extra_people                                 <chr> "$66.00", "$350.00", "...
## $ minimum_nights                               <dbl> 2, 3, 3, 7, 7, 2, 3, 6...
## $ maximum_nights                               <dbl> 15, 30, 31, 14, 31, 10...
## $ minimum_minimum_nights                       <dbl> 2, 3, 3, 3, 7, 2, 3, 6...
## $ maximum_minimum_nights                       <dbl> 2, 3, 3, 5, 7, 2, 3, 6...
## $ minimum_maximum_nights                       <dbl> 15, 30, 1125, 14, 1125...
## $ maximum_maximum_nights                       <dbl> 15, 30, 1125, 14, 1125...
## $ minimum_nights_avg_ntm                       <dbl> 2.0, 3.0, 3.0, 4.1, 7....
## $ maximum_nights_avg_ntm                       <dbl> 15, 30, 1125, 14, 1125...
## $ calendar_updated                             <chr> "5 months ago", "4 mon...
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30                              <dbl> 29, 28, 29, 21, 0, 0, ...
## $ availability_60                              <dbl> 59, 58, 59, 21, 0, 0, ...
## $ availability_90                              <dbl> 89, 88, 89, 21, 0, 0, ...
## $ availability_365                             <dbl> 89, 363, 172, 21, 0, 5...
## $ calendar_last_scraped                        <date> 2020-06-28, 2020-06-2...
## $ number_of_reviews                            <dbl> 168, 50, 293, 22, 90, ...
## $ number_of_reviews_ltm                        <dbl> 1, 4, 31, 2, 0, 0, 1, ...
## $ first_review                                 <date> 2009-09-04, 2013-12-0...
## $ last_review                                  <date> 2019-07-19, 2019-12-1...
## $ review_scores_rating                         <dbl> 96, 98, 91, 98, 94, 97...
## $ review_scores_accuracy                       <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_cleanliness                    <dbl> 9, 10, 9, 10, 9, 10, 1...
## $ review_scores_checkin                        <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_communication                  <dbl> 10, 10, 10, 10, 9, 10,...
## $ review_scores_location                       <dbl> 9, 10, 10, 10, 10, 10,...
## $ review_scores_value                          <dbl> 9, 10, 9, 10, 9, 9, 9,...
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, F...
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, F...
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy                          <chr> "moderate", "moderate"...
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, F...
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count               <dbl> 1, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_entire_homes  <dbl> 0, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0...
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month                            <dbl> 1.28, 0.62, 2.48, 0.18...
(#tab:EDA_raw_glimpse)Data summary
Name listings
Number of rows 28523
Number of columns 106
_______________________
Column type frequency:
character 45
Date 5
logical 16
numeric 40
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 28523 0
name 57 1.00 1 211 0 26907 0
summary 1096 0.96 1 1000 0 26986 0
space 11390 0.60 1 1000 0 16806 0
description 515 0.98 1 1000 0 27748 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 12407 0.57 1 1000 0 15520 0
notes 20832 0.27 1 1000 0 7366 0
transit 11316 0.60 1 1000 0 16714 0
access 15751 0.45 1 1000 0 11251 0
interaction 14111 0.51 1 1000 0 13728 0
house_rules 13332 0.53 1 1000 0 13703 0
picture_url 0 1.00 81 146 0 28282 0
host_url 0 1.00 37 43 0 25745 0
host_name 12 1.00 1 34 0 6415 0
host_location 93 1.00 2 152 0 863 0
host_about 14028 0.51 1 3550 0 12478 34
host_response_time 11 1.00 3 18 0 5 0
host_response_rate 11 1.00 2 4 0 44 0
host_acceptance_rate 11 1.00 2 4 0 100 0
host_thumbnail_url 11 1.00 55 106 0 25669 0
host_picture_url 11 1.00 57 109 0 25669 0
host_neighbourhood 8007 0.72 1 21 0 56 0
host_verifications 0 1.00 2 156 0 316 0
street 0 1.00 10 61 0 528 0
neighbourhood 0 1.00 5 14 0 21 0
neighbourhood_cleansed 0 1.00 5 25 0 11 0
city 13 1.00 1 26 0 136 0
state 24042 0.16 1 25 0 170 0
market 864 0.97 6 21 0 10 0
smart_location 0 1.00 10 35 0 158 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
property_type 0 1.00 3 22 0 29 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1179 0 26634 0
price 0 1.00 5 10 0 611 0
weekly_price 25008 0.12 7 11 0 718 0
monthly_price 26971 0.05 7 11 0 476 0
security_deposit 13845 0.51 5 10 0 386 0
cleaning_fee 8968 0.69 5 9 0 430 0
extra_people 0 1.00 5 9 0 265 0
calendar_updated 0 1.00 5 13 0 82 0
cancellation_policy 0 1.00 8 27 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
host_since 11 1.00 2008-06-27 2020-06-26 2015-06-22 3379
calendar_last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
first_review 4968 0.83 2009-09-04 2020-06-28 2017-08-07 2582
last_review 4968 0.83 2011-08-10 2020-06-28 2019-07-29 1846

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 28523 0 NaN :
medium_url 28523 0 NaN :
xl_picture_url 28523 0 NaN :
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
host_has_profile_pic 11 1 1.00 TRU: 28445, FAL: 67
host_identity_verified 11 1 0.38 FAL: 17790, TRU: 10722
neighbourhood_group_cleansed 28523 0 NaN :
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016
has_availability 0 1 1.00 TRU: 28523
requires_license 0 1 0.00 FAL: 28523
license 28523 0 NaN :
jurisdiction_names 28523 0 NaN :
instant_bookable 0 1 0.29 FAL: 20360, TRU: 8163
is_business_travel_ready 0 1 0.00 FAL: 28523
require_guest_profile_picture 0 1 0.01 FAL: 28379, TRU: 144
require_guest_phone_verification 0 1 0.01 FAL: 28327, TRU: 196

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.04e+07 1.23e+07 6.98e+03 1.01e+07 1.95e+07 3.04e+07 4.40e+07 <U+2587><U+2587><U+2587><U+2585><U+2586>
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 6.73e+07 7.62e+07 5.13e+02 1.28e+07 3.60e+07 9.46e+07 3.52e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_listings_count 11 1.00 5.46e+00 3.43e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 7.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 11 1.00 5.46e+00 3.43e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 7.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 815 0.97 2.07e+03 3.91e+02 2.00e+01 1.86e+03 2.20e+03 2.30e+03 2.10e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 5.57e+01 2.00e-02 5.56e+01 5.57e+01 5.57e+01 5.57e+01 5.57e+01 <U+2581><U+2583><U+2587><U+2587><U+2581>
longitude 0 1.00 1.26e+01 3.00e-02 1.24e+01 1.25e+01 1.26e+01 1.26e+01 1.26e+01 <U+2581><U+2582><U+2587><U+2585><U+2582>
accommodates 0 1.00 3.32e+00 1.63e+00 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.60e+01 <U+2587><U+2582><U+2581><U+2581><U+2581>
bathrooms 12 1.00 1.08e+00 2.80e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 29 1.00 1.55e+00 1.06e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 1.01e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 103 1.00 2.04e+00 1.44e+00 0.00e+00 1.00e+00 2.00e+00 3.00e+00 2.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
square_feet 28130 0.01 7.22e+02 5.76e+02 0.00e+00 1.20e+02 7.64e+02 1.08e+03 2.80e+03 <U+2587><U+2587><U+2583><U+2581><U+2581>
guests_included 0 1.00 1.52e+00 1.06e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 3.85e+00 1.81e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 6.21e+02 5.53e+02 1.00e+00 1.50e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_minimum_nights 0 1.00 3.84e+00 1.81e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 0 1.00 4.12e+00 1.93e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 0 1.00 6.67e+02 5.48e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 0 1.00 6.70e+02 5.47e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 0 1.00 3.97e+00 1.84e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 0 1.00 6.68e+02 5.47e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 5.87e+00 1.04e+01 0.00e+00 0.00e+00 0.00e+00 8.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2582>
availability_60 0 1.00 1.12e+01 2.01e+01 0.00e+00 0.00e+00 0.00e+00 1.40e+01 6.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2582>
availability_90 0 1.00 1.65e+01 3.00e+01 0.00e+00 0.00e+00 0.00e+00 1.80e+01 9.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_365 0 1.00 4.95e+01 9.92e+01 0.00e+00 0.00e+00 0.00e+00 3.50e+01 3.65e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 1.36e+01 2.71e+01 0.00e+00 1.00e+00 5.00e+00 1.50e+01 6.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 2.74e+00 6.64e+00 0.00e+00 0.00e+00 0.00e+00 3.00e+00 3.76e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 5447 0.81 9.52e+01 6.83e+00 2.00e+01 9.30e+01 9.70e+01 1.00e+02 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 5468 0.81 9.73e+00 6.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 5466 0.81 9.38e+00 9.60e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 5488 0.81 9.82e+00 5.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 5470 0.81 9.86e+00 5.30e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 5491 0.81 9.60e+00 6.80e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 5495 0.81 9.45e+00 7.60e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 4.45e+00 2.81e+01 1.00e+00 1.00e+00 1.00e+00 1.00e+00 2.81e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 4.17e+00 2.81e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 2.81e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 2.60e-01 7.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.20e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 1.00e-02 3.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.30e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 4968 0.83 4.90e-01 7.30e-01 1.00e-02 1.20e-01 2.80e-01 5.90e-01 3.06e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

Using glimpse() I find that there 28,523 observations (rows) across 106 variables (columns). I can already detect that some data types are assigned wrongly (for example the host_acceptance_rate is assigned as character, but should be numeric, double). I bear this in mind, alongside an awareness of other potential issues with the data to be addressed as I proceed.

When I skim, I make 4 prominent observations:

Data Wrangling

I found out that a lot of variables don’t have the correct data type assigned. Therefore, I change the data types using as.factor and parse_number. Subsequently, I verify the conversion with typeof().

Recoding Variables

listings_cleaned <- listings %>% 
  
  mutate(
    #I recode some variables to factors 
    host_response_time = as.factor(host_response_time), 
    host_neighbourhood = as.factor(host_neighbourhood),
    neighbourhood = as.factor(neighbourhood),
    neighbourhood_cleansed = as.factor(neighbourhood_cleansed),
    property_type = as.factor(property_type),
    room_type = as.factor(room_type),
    bed_type = as.factor(bed_type),
    cancellation_policy = as.factor(cancellation_policy),
    
   #I recode some variables to numeric 
    host_response_rate = parse_number(host_response_rate), #I use parse_number as some of the variables have dollar signs or % signs, and thus as.numeric would create all N/A's for these values
    host_acceptance_rate = parse_number(host_acceptance_rate),
    price = parse_number(price),
    weekly_price = parse_number(weekly_price),
    monthly_price = parse_number(monthly_price),
    security_deposit = parse_number(security_deposit),
    cleaning_fee = parse_number(cleaning_fee),
    extra_people = parse_number(extra_people)
    )

#I now verify the conversion of data types for the above parsed variables 
typeof(listings_cleaned$host_response_rate)
typeof(listings_cleaned$host_acceptance_rate)
typeof(listings_cleaned$price)
typeof(listings_cleaned$weekly_price)
typeof(listings_cleaned$cleaning_fee)
typeof(listings_cleaned$host_response_rate)
typeof(listings_cleaned$extra_people)

#I find that in all cases the conversion has been successful, since they are all now of type double

I now inspect the new dataset:

(#tab:skim_data_cleaned)Data summary
Name listings_cleaned
Number of rows 28523
Number of columns 106
_______________________
Column type frequency:
character 29
Date 5
factor 8
logical 16
numeric 48
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 28523 0
name 57 1.00 1 211 0 26907 0
summary 1096 0.96 1 1000 0 26986 0
space 11390 0.60 1 1000 0 16806 0
description 515 0.98 1 1000 0 27748 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 12407 0.57 1 1000 0 15520 0
notes 20832 0.27 1 1000 0 7366 0
transit 11316 0.60 1 1000 0 16714 0
access 15751 0.45 1 1000 0 11251 0
interaction 14111 0.51 1 1000 0 13728 0
house_rules 13332 0.53 1 1000 0 13703 0
picture_url 0 1.00 81 146 0 28282 0
host_url 0 1.00 37 43 0 25745 0
host_name 12 1.00 1 34 0 6415 0
host_location 93 1.00 2 152 0 863 0
host_about 14028 0.51 1 3550 0 12478 34
host_thumbnail_url 11 1.00 55 106 0 25669 0
host_picture_url 11 1.00 57 109 0 25669 0
host_verifications 0 1.00 2 156 0 316 0
street 0 1.00 10 61 0 528 0
city 13 1.00 1 26 0 136 0
state 24042 0.16 1 25 0 170 0
market 864 0.97 6 21 0 10 0
smart_location 0 1.00 10 35 0 158 0
country_code 0 1.00 2 2 0 1 0
country 0 1.00 7 7 0 1 0
amenities 0 1.00 2 1179 0 26634 0
calendar_updated 0 1.00 5 13 0 82 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
host_since 11 1.00 2008-06-27 2020-06-26 2015-06-22 3379
calendar_last_scraped 0 1.00 2020-06-26 2020-06-30 2020-06-27 5
first_review 4968 0.83 2009-09-04 2020-06-28 2017-08-07 2582
last_review 4968 0.83 2011-08-10 2020-06-28 2019-07-29 1846

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
host_response_time 11 1.00 FALSE 5 N/A: 22887, wit: 2371, wit: 1690, wit: 1151
host_neighbourhood 8007 0.72 FALSE 56 Nør: 3876, Ves: 2919, Fre: 2399, Ind: 2375
neighbourhood 0 1.00 FALSE 21 Nør: 5176, Ves: 4279, Ind: 3484, Fre: 3343
neighbourhood_cleansed 0 1.00 FALSE 11 Nrr: 5166, Ves: 4447, Ind: 3784, Fre: 3354
property_type 0 1.00 FALSE 29 Apa: 23942, Con: 1673, Hou: 1365, Tow: 554
room_type 0 1.00 FALSE 4 Ent: 24054, Pri: 4353, Sha: 78, Hot: 38
bed_type 0 1.00 FALSE 5 Rea: 28139, Pul: 207, Fut: 94, Cou: 46
cancellation_policy 0 1.00 FALSE 4 fle: 12435, mod: 8919, str: 6888, sup: 281

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 28523 0 NaN :
medium_url 28523 0 NaN :
xl_picture_url 28523 0 NaN :
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
host_has_profile_pic 11 1 1.00 TRU: 28445, FAL: 67
host_identity_verified 11 1 0.38 FAL: 17790, TRU: 10722
neighbourhood_group_cleansed 28523 0 NaN :
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016
has_availability 0 1 1.00 TRU: 28523
requires_license 0 1 0.00 FAL: 28523
license 28523 0 NaN :
jurisdiction_names 28523 0 NaN :
instant_bookable 0 1 0.29 FAL: 20360, TRU: 8163
is_business_travel_ready 0 1 0.00 FAL: 28523
require_guest_profile_picture 0 1 0.01 FAL: 28379, TRU: 144
require_guest_phone_verification 0 1 0.01 FAL: 28327, TRU: 196

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.04e+07 1.23e+07 6.98e+03 1.01e+07 1.95e+07 3.04e+07 4.40e+07 <U+2587><U+2587><U+2587><U+2585><U+2586>
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 6.73e+07 7.62e+07 5.13e+02 1.28e+07 3.60e+07 9.46e+07 3.52e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_response_rate 22898 0.20 8.98e+01 2.65e+01 0.00e+00 1.00e+02 1.00e+02 1.00e+02 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_acceptance_rate 12447 0.56 7.36e+01 3.24e+01 0.00e+00 5.00e+01 8.90e+01 1.00e+02 1.00e+02 <U+2582><U+2581><U+2582><U+2582><U+2587>
host_listings_count 11 1.00 5.46e+00 3.43e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 7.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 11 1.00 5.46e+00 3.43e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 7.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 815 0.97 2.07e+03 3.91e+02 2.00e+01 1.86e+03 2.20e+03 2.30e+03 2.10e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 5.57e+01 2.00e-02 5.56e+01 5.57e+01 5.57e+01 5.57e+01 5.57e+01 <U+2581><U+2583><U+2587><U+2587><U+2581>
longitude 0 1.00 1.26e+01 3.00e-02 1.24e+01 1.25e+01 1.26e+01 1.26e+01 1.26e+01 <U+2581><U+2582><U+2587><U+2585><U+2582>
accommodates 0 1.00 3.32e+00 1.63e+00 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.60e+01 <U+2587><U+2582><U+2581><U+2581><U+2581>
bathrooms 12 1.00 1.08e+00 2.80e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 29 1.00 1.55e+00 1.06e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 1.01e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 103 1.00 2.04e+00 1.44e+00 0.00e+00 1.00e+00 2.00e+00 3.00e+00 2.50e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
square_feet 28130 0.01 7.22e+02 5.76e+02 0.00e+00 1.20e+02 7.64e+02 1.08e+03 2.80e+03 <U+2587><U+2587><U+2583><U+2581><U+2581>
price 0 1.00 8.49e+02 1.07e+03 0.00e+00 4.98e+02 6.98e+02 9.96e+02 6.92e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
weekly_price 25008 0.12 4.90e+03 4.94e+03 1.50e+02 3.00e+03 4.02e+03 5.50e+03 1.49e+05 <U+2587><U+2581><U+2581><U+2581><U+2581>
monthly_price 26971 0.05 1.67e+04 1.88e+04 4.75e+02 1.00e+04 1.39e+04 2.00e+04 3.73e+05 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 13845 0.51 1.45e+03 3.29e+03 0.00e+00 0.00e+00 0.00e+00 1.50e+03 3.52e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 8968 0.69 3.07e+02 2.48e+02 0.00e+00 1.50e+02 2.50e+02 4.00e+02 4.00e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.52e+00 1.06e+00 1.00e+00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1.00 6.39e+01 1.26e+02 0.00e+00 0.00e+00 0.00e+00 1.00e+02 2.02e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 3.85e+00 1.81e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 6.21e+02 5.53e+02 1.00e+00 1.50e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_minimum_nights 0 1.00 3.84e+00 1.81e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 0 1.00 4.12e+00 1.93e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 0 1.00 6.67e+02 5.48e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 0 1.00 6.70e+02 5.47e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 0 1.00 3.97e+00 1.84e+01 1.00e+00 2.00e+00 3.00e+00 4.00e+00 1.10e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 0 1.00 6.68e+02 5.47e+02 1.00e+00 2.00e+01 1.12e+03 1.12e+03 1.00e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 5.87e+00 1.04e+01 0.00e+00 0.00e+00 0.00e+00 8.00e+00 3.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2582>
availability_60 0 1.00 1.12e+01 2.01e+01 0.00e+00 0.00e+00 0.00e+00 1.40e+01 6.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2582>
availability_90 0 1.00 1.65e+01 3.00e+01 0.00e+00 0.00e+00 0.00e+00 1.80e+01 9.00e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_365 0 1.00 4.95e+01 9.92e+01 0.00e+00 0.00e+00 0.00e+00 3.50e+01 3.65e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 1.36e+01 2.71e+01 0.00e+00 1.00e+00 5.00e+00 1.50e+01 6.37e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 2.74e+00 6.64e+00 0.00e+00 0.00e+00 0.00e+00 3.00e+00 3.76e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 5447 0.81 9.52e+01 6.83e+00 2.00e+01 9.30e+01 9.70e+01 1.00e+02 1.00e+02 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 5468 0.81 9.73e+00 6.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 5466 0.81 9.38e+00 9.60e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 5488 0.81 9.82e+00 5.60e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 5470 0.81 9.86e+00 5.30e-01 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 5491 0.81 9.60e+00 6.80e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 5495 0.81 9.45e+00 7.60e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 4.45e+00 2.81e+01 1.00e+00 1.00e+00 1.00e+00 1.00e+00 2.81e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 4.17e+00 2.81e+01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 2.81e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 2.60e-01 7.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.20e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 1.00e-02 3.30e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.30e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 4968 0.83 4.90e-01 7.30e-01 1.00e-02 1.20e-01 2.80e-01 5.90e-01 3.06e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

I compare the newly created dataset, listings_cleaned, to the original dataset, listings, and detect that the count of N/A’s seem to vary for host_response_rate and host_response_time.

To understand the reason, I inspect one variable in depth: host_response_rate. In order to do this, I print the unique values of host_response_rate for both datasets and then skim the variable characteristics in each dataset.

#I look at the variable host_response_rate in depth
unique(listings$host_response_rate) #levels of host_response_rate in listings
##  [1] "N/A"  "100%" "67%"  "80%"  "50%"  "83%"  "0%"   "82%"  "64%"  "71%" 
## [11] "33%"  "75%"  "90%"  NA     "60%"  "88%"  "78%"  "86%"  "89%"  "79%" 
## [21] "77%"  "25%"  "40%"  "91%"  "20%"  "14%"  "17%"  "29%"  "70%"  "94%" 
## [31] "43%"  "63%"  "87%"  "93%"  "97%"  "56%"  "38%"  "99%"  "57%"  "44%" 
## [41] "92%"  "30%"  "36%"  "10%"  "59%"
unique(listings_cleaned$host_response_rate) #levels of host_response_rate in listings_cleaned
##  [1]  NA 100  67  80  50  83   0  82  64  71  33  75  90  60  88  78  86  89  79
## [20]  77  25  40  91  20  14  17  29  70  94  43  63  87  93  97  56  38  99  57
## [39]  44  92  30  36  10  59
skim(listings$host_response_rate)
(#tab:checking_listings_cleaned)Data summary
Name listings$host_response_ra…
Number of rows 28523
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 11 1 2 4 0 44 0
skim(listings_cleaned$host_response_rate)
(#tab:checking_listings_cleaned)Data summary
Name listings_cleaned$host_res…
Number of rows 28523
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 22898 0.2 89.8 26.5 0 100 100 100 100 <U+2581><U+2581><U+2581><U+2581><U+2587>

I find that in the dataset ‘listings’ I have two different N/A values: “N/A” and NA, while in the ‘listings_cleaned’ dataset, both are grouped under NA. Moreover, I detect that in the count of n_missing there is a difference of 22887. Let’s count the number of “N/A” observations in ‘listings’, apart from the 11 NA.

listings%>% 
  filter(host_response_rate=="N/A") %>% 
  summarise(count=n())

count
22887
The observations in listings that had “N/A” as a value are exactly 22887 and this explains the difference in the number of NA’s between the unclean and cleaned dataset. The conversion of data type using parse_number() converted “N/A” into NA. The variable ‘host_response_rate’ is a similar case, where there are hidden missing values in the unclean dataset.

Now I can be confident that my transformation of variables has been accurate and has not created unintended NAs, I move onto selecting only the most prominent variables for my subsequent analyses.

Selecting Variables

Before I generate summary statistics and visualizations, I select relevant variables as I do not need all 106. To be able to easily refer to the variables, I print a list of all variables with their corresponding column number, which I use as an identifier:

variable_codes <- 
  c(colnames(
    listings_cleaned)
    )

print(variable_codes)
##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

Now, I can select the variables that I believe will truly be relevant to my subsequent analyses:

#I subsequently select relevant variables on the basis of their column numbers in the listings_cleaned dataframe

#the listings_necessary dataframe contains all the variables and data with which I will proceed: below I provide a brief annotation of which variables I chose 
listings_necessary <- 
  listings_cleaned %>% 
  select(c(2, #listings url
           #5:10, #if I than can use keywords to filter relevant information 
            15, #house rules
            23, #host since
            27:29, #host info: host_response_rate, host_acceptance_rate, host_is_superhost                   
            33, #host listings count
            35:37, #host info: host_verifications, host_has_profile_pic, host_identity_verified
            39:41, #neighborhood: neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed
            44, #zipcode
           49:50,#latitude, longitude
           51:61, #flat info: is_location_exact, property_type, room_type, accommodates, bathrooms, bedrooms, beds, amenities, square_feet, price
            64:68, # security_deposit, leaning_fee, guests_included, extra_people, minimum_nights
            83:93, #number_of_reviews. number_of_reviews_ltm", "first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communicatio, review_scores_location. review_scores_value"
            97, #instant_bookable
            98, #is_business_travel_ready
            99, #cancellation_policy
            106) #reviews_per_month
         )

Now that I have created listings_necessary, I will be working only with the columns (variables) which I think could be relevant to my subsequent analyses.Therefore, variables such as the host names, host ID and descriptive variables are not in the selected list, since they do not provide any meaningful insights. However, I have not yet reduced the variables down only to those I intend to use - there are many variables here which are likely to be redundant in my final analyses, but which I must explore further to decide whether to use in their present form, transform further into a useful form, or eliminate altogether.

Before I make further evaluations of which variables matter most to my analyses, I address missingness, which is a prominent issue for a small number of my chosen variables.

Adjusting Missing Values

Identifying Missing Values

I ask the question: “are there any missing values?”, as I review the transformed data with the skim() function.

# Looking at the transformed data to find out variables with missing values
skim(listings_necessary)
(#tab:identifying_missing_values)Data summary
Name listings_necessary
Number of rows 28523
Number of columns 47
_______________________
Column type frequency:
character 4
Date 3
factor 6
logical 7
numeric 27
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 28523 0
house_rules 13332 0.53 1 1000 0 13703 0
host_verifications 0 1.00 2 156 0 316 0
amenities 0 1.00 2 1179 0 26634 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 11 1.00 2008-06-27 2020-06-26 2015-06-22 3379
first_review 4968 0.83 2009-09-04 2020-06-28 2017-08-07 2582
last_review 4968 0.83 2011-08-10 2020-06-28 2019-07-29 1846

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighbourhood 0 1 FALSE 21 Nør: 5176, Ves: 4279, Ind: 3484, Fre: 3343
neighbourhood_cleansed 0 1 FALSE 11 Nrr: 5166, Ves: 4447, Ind: 3784, Fre: 3354
property_type 0 1 FALSE 29 Apa: 23942, Con: 1673, Hou: 1365, Tow: 554
room_type 0 1 FALSE 4 Ent: 24054, Pri: 4353, Sha: 78, Hot: 38
bed_type 0 1 FALSE 5 Rea: 28139, Pul: 207, Fut: 94, Cou: 46
cancellation_policy 0 1 FALSE 4 fle: 12435, mod: 8919, str: 6888, sup: 281

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 11 1 0.10 FAL: 25610, TRU: 2902
host_has_profile_pic 11 1 1.00 TRU: 28445, FAL: 67
host_identity_verified 11 1 0.38 FAL: 17790, TRU: 10722
neighbourhood_group_cleansed 28523 0 NaN :
is_location_exact 0 1 0.79 TRU: 22507, FAL: 6016
instant_bookable 0 1 0.29 FAL: 20360, TRU: 8163
is_business_travel_ready 0 1 0.00 FAL: 28523

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_response_rate 22898 0.20 89.75 26.47 0.00 100.00 100.00 100.00 100.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_acceptance_rate 12447 0.56 73.56 32.40 0.00 50.00 89.00 100.00 100.0 <U+2582><U+2581><U+2582><U+2582><U+2587>
host_listings_count 11 1.00 5.46 34.32 0.00 1.00 1.00 1.00 737.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 815 0.97 2069.76 390.81 20.00 1855.75 2200.00 2300.00 21000.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 55.68 0.02 55.62 55.67 55.68 55.70 55.7 <U+2581><U+2583><U+2587><U+2587><U+2581>
longitude 0 1.00 12.56 0.03 12.45 12.54 12.55 12.58 12.6 <U+2581><U+2582><U+2587><U+2585><U+2582>
accommodates 0 1.00 3.32 1.63 1.00 2.00 3.00 4.00 16.0 <U+2587><U+2582><U+2581><U+2581><U+2581>
bathrooms 12 1.00 1.08 0.28 0.00 1.00 1.00 1.00 10.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 29 1.00 1.55 1.06 0.00 1.00 1.00 2.00 101.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 103 1.00 2.04 1.44 0.00 1.00 2.00 3.00 25.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
square_feet 28130 0.01 721.89 576.10 0.00 120.00 764.00 1076.00 2799.0 <U+2587><U+2587><U+2583><U+2581><U+2581>
price 0 1.00 849.19 1066.81 0.00 498.00 698.00 996.00 69175.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 13845 0.51 1448.66 3292.68 0.00 0.00 0.00 1500.00 35198.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 8968 0.69 306.88 247.65 0.00 150.00 250.00 400.00 4000.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.52 1.06 1.00 1.00 1.00 2.00 16.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1.00 63.88 126.23 0.00 0.00 0.00 100.00 2024.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 3.85 18.07 1.00 2.00 3.00 4.00 1100.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews 0 1.00 13.57 27.09 0.00 1.00 5.00 15.00 637.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 2.74 6.64 0.00 0.00 0.00 3.00 376.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 5447 0.81 95.16 6.83 20.00 93.00 97.00 100.00 100.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 5468 0.81 9.73 0.66 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 5466 0.81 9.38 0.96 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 5488 0.81 9.82 0.56 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 5470 0.81 9.86 0.53 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 5491 0.81 9.60 0.68 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 5495 0.81 9.45 0.76 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
reviews_per_month 4968 0.83 0.49 0.73 0.01 0.12 0.28 0.59 30.6 <U+2587><U+2581><U+2581><U+2581><U+2581>

I identify a few key variables with substantial missingness, including cleaning_fee, security_deposit, bathrooms, bedrooms, beds, and review scores, among others. I address these by asking ourselves (and investigating in the chunk below):

  • Which variables have problematic missingness?
  • How many variables have missing values?
  • What does this missingness indicate, and therefore how should I address it?

Addressing Missing Values

#I find that the variables "host_response_rate","weekly_price","monthly_price" and "host_response_time" have more than 80% data missing, hence I do not impute them, since this would most likely introduce errors into my analyses

# missing value of cleaning_fee indicates there wasn't any cleaning fee and hence I impute it by 0
listings_necessary <- listings_necessary %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

# missing value of security_deposit indicates there wasn't any security deposit and hence I impute it by 0
listings_necessary <- listings_necessary %>%
  mutate(security_deposit = case_when(
    is.na(security_deposit) ~ 0, 
    TRUE ~ security_deposit
  ))

# missing value of bathrooms indicates there wasn't any bathroom and hence I impute it by 0
listings_necessary <- listings_necessary %>%
  mutate(bathrooms = case_when(
    is.na(bathrooms) ~ 0, 
    TRUE ~ bathrooms
  ))
# missing value of bedrooms indicates there wasn't any bedroom and hence I impute it by 0. I have values like "Shared room", "Hotel room" and "Private room" which may indicate ambiguity on their status as a bedroom
listings_necessary <- listings_necessary %>%
  mutate(bedrooms = case_when(
    is.na(bedrooms) ~ 0, 
    TRUE ~ bedrooms
  ))

# If it's not mentioned then it's possible there is no bed at all. It is also possible that there is a "sofa" or an "airbed" in place which are the values of the variable bed_type
listings_necessary <- listings_necessary %>%
  mutate(beds = case_when(
    is.na(beds) ~ 0, 
    TRUE ~ beds
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_rating  = case_when(
    is.na(review_scores_rating ) ~ median(review_scores_rating,na.rm = TRUE), 
    TRUE ~ review_scores_rating 
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_accuracy = case_when(
    is.na(review_scores_accuracy) ~ median(review_scores_accuracy, na.rm = TRUE), 
    TRUE ~ review_scores_accuracy
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_cleanliness  = case_when(
    is.na(review_scores_cleanliness) ~ median(review_scores_cleanliness, na.rm = TRUE), 
    TRUE ~ review_scores_cleanliness 
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_checkin   = case_when(
    is.na(review_scores_checkin) ~ median(review_scores_checkin,na.rm = TRUE), 
    TRUE ~ review_scores_checkin  
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_communication   = case_when(
    is.na(review_scores_communication) ~ median(review_scores_communication,na.rm = TRUE), 
    TRUE ~ review_scores_communication  
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_location    = case_when(
    is.na(review_scores_location) ~ median(review_scores_location, na.rm = TRUE), 
    TRUE ~ review_scores_location   
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(review_scores_value     = case_when(
    is.na(review_scores_value) ~ median(review_scores_value, na.rm = TRUE ), 
    TRUE ~ review_scores_value    
  ))


# These are probably new listings which do not have a rating yet. I impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
  mutate(reviews_per_month     = case_when(
    is.na(reviews_per_month) ~ median(reviews_per_month, na.rm = TRUE), 
    TRUE ~ reviews_per_month    
  ))


# The following variables have 11 datapoints which are simultaneously missing. This maybe caused due to some issues in data storage, however I will follow a conservative approach in imputing these, which is undertaken as follows:


# If the host was a Superhost, his data wouldn't be missing and hence I impute it by FALSE
listings_necessary <- listings_necessary %>%
  mutate(host_is_superhost     = case_when(
    is.na(host_is_superhost) ~ FALSE, 
    TRUE ~ host_is_superhost    
  ))


# I impute host_has_profile_pic by FALSE following my conservative approach and previous logic for Superhosts: if it were true, it would be true - no response means there is most likely no profile picture
listings_necessary <- listings_necessary %>%
  mutate(host_has_profile_pic     = case_when(
    is.na(host_has_profile_pic) ~ FALSE, 
    TRUE ~ host_has_profile_pic    
  ))


# I impute host_identity_verified by FALSE following my conservative approach and previous logic for Superhosts: if it were true, it would be true - no response means there is most likely no verification
listings_necessary <- listings_necessary %>%
  mutate(host_identity_verified     = case_when(
    is.na(host_identity_verified) ~ FALSE, 
    TRUE ~ host_identity_verified    
  ))
# These are probably new listings which do not have their listings recorded. I impute it by a central value (Median) 
listings_necessary <- listings_necessary %>%
  mutate(host_listings_count     = case_when(
    is.na(host_listings_count) ~ median(host_listings_count, na.rm =TRUE), 
    TRUE ~ host_listings_count    
  ))

Modifying and Creating Variables

Creating prop_type_simplified

Further, I have certain variables, including property_type that have a known set of values but a skewed frequency distribution. In order to clean these, I will build factor variables by grouping the values of lower frequency into one category and keeping the values with higher frequency as categories. I perform these operations in the chunk below:

#I now make factor variables from variables having a known set of possible values but only a few values with high frequency 

# first, I look at frequencies of different values of property_type
count_property_type <- listings_necessary %>% 
  count(property_type) %>%
  arrange(desc(n))


# next, I use this to produce a tabular Output for Frequency Distribution of property_type
kbl(count_property_type, col.names=c("Values of property type","Frequency")) %>% 
  kable_styling()
Values of property type Frequency
Apartment 23942
Condominium 1673
House 1365
Townhouse 554
Serviced apartment 331
Loft 231
Villa 190
Hostel 28
Guesthouse 23
Bungalow 22
Guest suite 22
Bed and breakfast 21
Houseboat 21
Boat 20
Hotel 20
Other 14
Tiny house 14
Cabin 10
Boutique hotel 5
Cottage 4
Hut 3
Aparthotel 2
Barn 2
Camper/RV 1
Casa particular (Cuba) 1
Earth house 1
Island 1
Lighthouse 1
Tent 1

I find that the 4 most common property types are Apartment, Condominium, House and Townhouse (in that order). I now check the percentage of the total data points that the top 4 most frequent values of property types constitute:

(sum(count_property_type[1:4,"n"])/sum(count_property_type[,"n"]))*100
## [1] 96.5

The 4 most common property types make up 96.5% of the total listings in my data, which is certainly high enough to use a simplified version of property type. I select the top 4 values and bundle the remaining ones into “Other”.

#I now select the top 4 values as categories and bundle everything else in "Other"
listings_necessary <- listings_necessary %>%
  mutate(prop_type_simplified = case_when(
    as.character(property_type) %in% c("Apartment","Condominium", "House","Townhouse") ~ as.character(property_type), 
    TRUE ~ "Others"
  ))

#I now check for the correct creation of prop_type_simplified. I look at the frequency distribution and find out that the new frequency distribution is consistent with the older one 
prop_type_creation_check <- listings_necessary %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n)) 

#as a result of this output, I can be confident that prop_type_simplified was correctly made

#I also provide a tabular output of the above classification check
kbl(prop_type_creation_check,col.names=c("Property Type","Property Type Simplified","Number of Listings")) %>% 
  kable_styling()
Property Type Property Type Simplified Number of Listings
Apartment Apartment 23942
Condominium Condominium 1673
House House 1365
Townhouse Townhouse 554
Serviced apartment Others 331
Loft Others 231
Villa Others 190
Hostel Others 28
Guesthouse Others 23
Bungalow Others 22
Guest suite Others 22
Bed and breakfast Others 21
Houseboat Others 21
Boat Others 20
Hotel Others 20
Other Others 14
Tiny house Others 14
Cabin Others 10
Boutique hotel Others 5
Cottage Others 4
Hut Others 3
Aparthotel Others 2
Barn Others 2
Camper/RV Others 1
Casa particular (Cuba) Others 1
Earth house Others 1
Island Others 1
Lighthouse Others 1
Tent Others 1

Creating Other Variables

Creating accommodates_simplified

I apply the same reasoning from my recoding of property_type into property_type_simplified to create other simplified variables which will be of greater use in my subsequent analyses.

I create the new variable acommodates_simplified. To do so I first look at the distribution of accommodates:

#I evaluate the frequencies of the different levels within the accommodates variable (the Number of people a listing can accommodate)
count_accommodation_limit <- listings_necessary %>% 
  count(accommodates) %>%
  arrange(desc(n))

#next, I use this to produce a tabular Output for Frequency Distribution of accommodates
kbl(count_accommodation_limit, col.names=c("Number of People Listing Accommodates","Number of Such Listings")) %>% 
  kable_styling()
Number of People Listing Accommodates Number of Such Listings
2 11649
4 6828
3 3842
5 2137
6 2089
1 967
8 434
7 346
10 94
9 88
12 26
11 10
16 10
14 2
15 1

I find that the distribution is evidently right (positively) skewed, so I can group greater observations to one value I did before. Therefore, I keep the levels 1-6 and group all levels >7 into 7 in order to transform accommodates into accommodates_simplified:

#I keep levels 2,4,3,5,6,1 and group number of accommodates greater than 7 into 7
listings_necessary <- listings_necessary %>%
  mutate(accommodates_simplified = case_when(
    accommodates %in% c(2,4,3,5,6,1) ~ accommodates, 
    TRUE ~ 7
  ))

#I check for the correct creation of accommodates_simplified. I look at the frequency distribution and find out that the new frequency distribution is consistent with the older one 
accommodates_simplified_creation_check <- listings_necessary %>%
  count(accommodates, accommodates_simplified) %>%
  arrange(desc(n)) 

#as a result of this output, I can be confident that prop_type_simplified was correctly made

#I also provide a tabular output of the above classification check
kbl(accommodates_simplified_creation_check,col.names=c("Accommodates","Accommodates Simplified","Number of Listings")) %>% 
  kable_styling()
Accommodates Accommodates Simplified Number of Listings
2 2 11649
4 4 6828
3 3 3842
5 5 2137
6 6 2089
1 1 967
8 7 434
7 7 346
10 7 94
9 7 88
12 7 26
11 7 10
16 7 10
14 7 2
15 7 1

Changing the data type of prop_type_simplified and accommodates_simplified

Now, I change the data type of the two newly created variables, making factor variables (as they have a known set of possible values but only a few values with high frequency):

#since both of these variables are of type character, I convert them to factor variables
listings_necessary <- listings_necessary %>% 
  
  mutate(
    #I recode these columns to factors 
    accommodates_simplified = as.numeric(accommodates_simplified), 
    prop_type_simplified = as.factor(prop_type_simplified)
  )

I find that none of the listings is ready for business travel, and thus can keep all listings as leisure listings.

Putting Bedrooms, Bathrooms and Beds in relation to accommodates_simplified

Bedrooms, bathrooms and beds are more insightful if I put them into relation to the accommodates. I do this by creating new variables:

#bedrooms_per_capacity = bedrooms / accommodates. This represents the number of bedrooms available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>% 
  mutate(
    bedrooms_per_capacity = bedrooms/accommodates
  )

#beds_per_capacity = beds / accommodates. This represents the number of beds available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>% 
  mutate(
    beds_per_capacity = beds/accommodates
  )

#bathrooms_per_capacity = bathrooms / accommodates. This represents the number of bathrooms available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>% 
  mutate(
    bathrooms_per_capacity = bathrooms/accommodates
  )

Evaluating minimum_nights Stay

Now, I want to find the most common value for minimum nights:

#here I find out the most common value for minimum mights
count_minimum_nights <- listings_necessary %>% 
  count(minimum_nights) %>%
  arrange(desc(n))

#I tabulate my output for interpretability
kbl(count_minimum_nights,align = 'l',col.names = c("Minimum Nights","Number of Listings")) %>% 
kable_styling()
Minimum Nights Number of Listings
2 8500
3 6479
1 5444
4 3117
5 2346
7 997
6 702
14 198
10 170
30 136
8 52
20 49
15 35
21 33
60 23
9 22
12 21
90 21
13 20
25 16
28 14
50 14
100 8
31 7
11 6
23 6
29 6
45 5
16 4
19 4
200 4
1000 4
17 3
40 3
70 3
80 3
120 3
18 2
22 2
24 2
26 2
27 2
49 2
85 2
160 2
180 2
360 2
365 2
999 2
34 1
35 1
39 1
42 1
43 1
48 1
56 1
59 1
61 1
66 1
75 1
89 1
92 1
102 1
107 1
150 1
270 1
300 1
430 1
500 1
1100 1

I find that, in descending order of commonality, the most common values for the variable minimum_nights are 2, 3, 1, 4 and 5. A value of 2, the most common value, stands out. I assume that one of the reasons for 2 minimum nights stay being the most common is the considerable decrease in host cleaning fee (expense), as compared with cleaning fee (expenses) for a 1 night stay. This means that the majority of hosts set a 2 night minimum in order to keep cleaning fees (and thus the overall price of the stay) down.

In order to substantiate this reasoning, I put cleaning fee into relation of the minimum nights stay:

listings_necessary %>% 
  select(minimum_nights, cleaning_fee) %>% 
  group_by(minimum_nights) %>% 
  summarise(mean = mean(cleaning_fee)) %>%
  arrange(desc(mean)) %>% 
  kbl(align = 'l',col.names = c("Minimum Nights","Average Cleaning Fee")) %>% 
kable_styling()
Minimum Nights Average Cleaning Fee
24 1500.0
35 1250.0
89 1000.0
56 450.0
200 375.0
1000 317.0
300 300.0
270 298.0
430 250.0
160 249.0
28 247.9
30 243.3
999 243.0
7 232.7
70 231.7
1 226.4
5 221.7
90 221.3
50 220.0
6 213.9
4 213.7
3 209.0
11 208.2
8 205.3
34 200.0
39 200.0
100 200.0
120 200.0
2 196.3
31 195.6
14 193.7
20 190.9
15 181.1
45 180.0
21 173.4
13 172.1
25 159.3
10 159.0
12 153.0
85 150.0
150 150.0
29 141.7
60 139.9
19 137.5
9 117.2
17 116.7
80 116.7
16 100.0
22 100.0
42 100.0
40 89.7
23 53.2
1100 50.0
18 0.0
26 0.0
27 0.0
43 0.0
48 0.0
49 0.0
59 0.0
61 0.0
66 0.0
75 0.0
92 0.0
102 0.0
107 0.0
180 0.0
360 0.0
365 0.0
500 0.0

Above table substantiates my reasoning, since the average cleaning fee for listings with a minimum 1 night stay is considerably higher than that for listings with a minimum 2 night stay.

Excluding Listings with minimum_nights<=4

I exclude listings for minimum nights greater than 4, since my analysis requires us to calculate the price of a 4 night stay in Copenhagen for 2 guests.

listings_necessary <- listings_necessary %>%
  filter(minimum_nights <= 4) # I use the less than, equal to sign in order to filter my selection

Check Resulting Data Types

I now ensure that my data wrangling stages have been successful using the skim function:

#I check that my recoding, imputation and variable creation has been successful, and find that it has been
skim(listings_necessary)
(#tab:check_data_types)Data summary
Name listings_necessary
Number of rows 23540
Number of columns 52
_______________________
Column type frequency:
character 4
Date 3
factor 7
logical 7
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 23540 0
house_rules 10980 0.53 1 1000 0 11313 0
host_verifications 0 1.00 2 156 0 295 0
amenities 0 1.00 2 1179 0 22096 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 6 1.00 2008-06-27 2020-06-26 2015-07-10 3281
first_review 3707 0.84 2009-09-04 2020-06-28 2017-08-20 2481
last_review 3707 0.84 2011-08-10 2020-06-28 2019-08-05 1769

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighbourhood 0 1 FALSE 21 Nør: 4386, Ves: 3568, Ind: 2955, Fre: 2707
neighbourhood_cleansed 0 1 FALSE 11 Nrr: 4378, Ves: 3710, Ind: 3205, Fre: 2715
property_type 0 1 FALSE 27 Apa: 19948, Con: 1422, Hou: 912, Tow: 400
room_type 0 1 FALSE 4 Ent: 19568, Pri: 3860, Sha: 74, Hot: 38
bed_type 0 1 FALSE 5 Rea: 23218, Pul: 180, Fut: 72, Cou: 38
cancellation_policy 0 1 FALSE 4 fle: 10376, mod: 7444, str: 5439, sup: 281
prop_type_simplified 0 1 FALSE 5 Apa: 19948, Con: 1422, Hou: 912, Oth: 858

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.11 FAL: 20878, TRU: 2662
host_has_profile_pic 0 1 1.00 TRU: 23482, FAL: 58
host_identity_verified 0 1 0.37 FAL: 14835, TRU: 8705
neighbourhood_group_cleansed 23540 0 NaN :
is_location_exact 0 1 0.79 TRU: 18574, FAL: 4966
instant_bookable 0 1 0.29 FAL: 16722, TRU: 6818
is_business_travel_ready 0 1 0.00 FAL: 23540

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_response_rate 18798 0.20 90.22 26.11 0.00 100.00 100.00 100.00 100.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
host_acceptance_rate 9766 0.59 74.56 31.60 0.00 53.00 89.00 100.00 100.0 <U+2581><U+2581><U+2582><U+2582><U+2587>
host_listings_count 0 1.00 6.33 37.65 0.00 1.00 1.00 1.00 737.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
zipcode 672 0.97 2063.43 375.61 20.00 1824.00 2200.00 2300.00 4672.0 <U+2581><U+2583><U+2587><U+2581><U+2581>
latitude 0 1.00 55.68 0.02 55.62 55.67 55.68 55.70 55.7 <U+2581><U+2583><U+2587><U+2587><U+2581>
longitude 0 1.00 12.56 0.03 12.45 12.54 12.55 12.58 12.6 <U+2581><U+2582><U+2587><U+2585><U+2582>
accommodates 0 1.00 3.26 1.59 1.00 2.00 3.00 4.00 16.0 <U+2587><U+2582><U+2581><U+2581><U+2581>
bathrooms 0 1.00 1.06 0.27 0.00 1.00 1.00 1.00 10.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
bedrooms 0 1.00 1.51 1.07 0.00 1.00 1.00 2.00 101.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 0 1.00 1.96 1.39 0.00 1.00 2.00 2.00 25.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
square_feet 23245 0.01 709.13 554.79 0.00 140.00 753.00 1033.00 2691.0 <U+2587><U+2587><U+2583><U+2581><U+2581>
price 0 1.00 846.70 1034.46 0.00 498.00 698.00 996.00 69175.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
security_deposit 0 1.00 728.89 2425.88 0.00 0.00 0.00 0.00 35000.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
cleaning_fee 0 1.00 209.05 247.85 0.00 0.00 150.00 300.00 4000.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
guests_included 0 1.00 1.52 1.03 1.00 1.00 1.00 2.00 14.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
extra_people 0 1.00 65.49 125.42 0.00 0.00 0.00 100.00 2024.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 2.31 0.97 1.00 2.00 2.00 3.00 4.0 <U+2585><U+2587><U+2581><U+2586><U+2583>
number_of_reviews 0 1.00 14.80 28.98 0.00 2.00 6.00 16.00 637.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 3.08 7.17 0.00 0.00 0.00 4.00 376.0 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 0 1.00 95.43 6.26 20.00 94.00 97.00 100.00 100.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 0 1.00 9.78 0.61 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 0 1.00 9.49 0.90 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 0 1.00 9.85 0.51 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 0 1.00 9.88 0.48 2.00 10.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 0 1.00 9.67 0.63 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 0 1.00 9.54 0.73 2.00 9.00 10.00 10.00 10.0 <U+2581><U+2581><U+2581><U+2581><U+2587>
reviews_per_month 0 1.00 0.50 0.72 0.01 0.17 0.28 0.56 30.6 <U+2587><U+2581><U+2581><U+2581><U+2581>
accommodates_simplified 0 1.00 3.22 1.46 1.00 2.00 3.00 4.00 7.0 <U+2587><U+2582><U+2585><U+2581><U+2582>
bedrooms_per_capacity 0 1.00 0.48 0.21 0.00 0.40 0.50 0.50 16.8 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds_per_capacity 0 1.00 0.60 0.26 0.00 0.50 0.50 0.75 4.0 <U+2587><U+2582><U+2581><U+2581><U+2581>
bathrooms_per_capacity 0 1.00 0.39 0.19 0.00 0.25 0.33 0.50 7.5 <U+2587><U+2581><U+2581><U+2581><U+2581>

skim() shows us that there are 31 ‘numeric’ variables and 7 ‘factor’ variables. I summarise their names in tables in the subsequent sub-section.

Visualizations:

I create three different lists, summarising with numeric, logical and factor variables, respectively.

Numeric Variables

#I isolate and tabulate numeric variables
num <- listings_necessary %>%
  select_if(is.numeric) %>%  #selecting numeric variables
select(-c("latitude", "longitude", "zipcode")) %>% #excluding irrelevant variables
  na.omit()

list_num <- c(colnames(num)) #create a list of names of the variables

kbl(list_num, col.names="Numeric Variables") %>%  #creating a table of the list
  kable_styling()
Numeric Variables
host_response_rate
host_acceptance_rate
host_listings_count
accommodates
bathrooms
bedrooms
beds
square_feet
price
security_deposit
cleaning_fee
guests_included
extra_people
minimum_nights
number_of_reviews
number_of_reviews_ltm
review_scores_rating
review_scores_accuracy
review_scores_cleanliness
review_scores_checkin
review_scores_communication
review_scores_location
review_scores_value
reviews_per_month
accommodates_simplified
bedrooms_per_capacity
beds_per_capacity
bathrooms_per_capacity

Factor Variables

#I isolate and tabulate categorical (factor) variables
factor <- listings_necessary %>% 
  select_if(is.factor) #selecting categorical variables

list_factor <- c(colnames(factor)) #create a list of names of the variables

kbl(list_factor, col.names="Factor Variables") %>% #creating a table of the list
  kable_styling()
Factor Variables
neighbourhood
neighbourhood_cleansed
property_type
room_type
bed_type
cancellation_policy
prop_type_simplified

Logical Variables

#I isolate and tabulate logical variables
logical <- listings_necessary %>% 
  select_if(is.logical) #selecting logical variables

list_logical <- c(colnames(logical)) #create a list of names of the variables

kbl(list_logical, col.names="Logical Variables") %>% #creating a table of the list
 kable_styling()
Logical Variables
host_is_superhost
host_has_profile_pic
host_identity_verified
neighbourhood_group_cleansed
is_location_exact
instant_bookable
is_business_travel_ready

Density plots: Continuous Variables

For the continuous variables I use density plots to look at the data. I group the continuous variables into four different groups: property-related, host-related, review-related and price-related variables.

Property-Related Variables

flat_longer <-  num %>% 
  select("accommodates_simplified", "bedrooms_per_capacity", "beds_per_capacity", "bathrooms_per_capacity", "square_feet", "guests_included", "minimum_nights", "bedrooms"
) %>%
  pivot_longer(names_to= "variable_name", values_to="values", everything() )

ggplot(flat_longer, aes(x=values), na.rm=TRUE)+
  geom_density(fill="grey")+
  facet_wrap(vars(variable_name), scales="free", ncol=4)+
  labs(x="",title="Exploring Property-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  theme(axis.text.y=element_blank()) + 
  NULL

The distribution of the above selected variables is not unified, and I therefore take this into acccount as I build my models. I note specifically that Bedroom per capacity has most values around 0.5 and could therefore be less valuable than the total count of bedrooms.

Host-Related Variables

host_longer <-  num %>% 
  select("host_response_rate", "host_acceptance_rate", "host_listings_count") %>%
  pivot_longer(names_to= "variable_name", values_to="values", everything() )

ggplot(host_longer, aes(x=values), na.rm=TRUE)+
  geom_density(fill="grey")+
  facet_wrap(vars(variable_name), scales="free", ncol=4)+
  labs(x="",title="Exploring Host-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  theme(axis.text.y=element_blank()) + 
  NULL

While the variable host_response_rate has most values around 100% and is therefore not interesting for further analysis, I are interested in evaluating host_acceptance_rate and host_listings_count.

Review-Related Variables

review_longer <-  num %>% 
  select("number_of_reviews", "number_of_reviews_ltm", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "reviews_per_month") %>%
  pivot_longer(names_to= "variable_name", values_to="values", everything())

ggplot(review_longer, aes(x=values), na.rm=TRUE)+
  geom_density(fill="grey")+
  facet_wrap(vars(variable_name), scales="free", ncol=4)+
  labs(x="",title="Exploring Review-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  theme(axis.text.y=element_blank()) + 
  NULL

Although most of the variables have heavily skewed distributions and tightly clustered observations which makes drawing inference challenging, I may distinguish an interesting distribution of review_scores_rating. At this stage, I speculate that review variables may account for significant differences between listing prices, and thus be a likely determinant of the price of a stay in Copenhahgen.

Price-Related Variables

price_longer <-  num %>% 
  select("price", "security_deposit", "cleaning_fee", "extra_people") %>%
  pivot_longer(names_to= "variable_name", values_to="values", everything() )

ggplot(price_longer, aes(x=values), na.rm=TRUE)+
  geom_density(fill="grey")+
  facet_wrap(vars(variable_name), scales="free", ncol=4)+
  labs(x="",title="Exploring Price-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuoUs variables", y="Density") +
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  theme(axis.text.y=element_blank()) + 
  scale_x_continuous(labels=label_dollar())+
  NULL

Overall, my goal is to explain and be able to predict the prices of renting AirBnB properties in Copenhagen. The price will be calculated by summarizing the price, cleaning fee and any additional fees for extra people not included in the base rate of the listing. Consequently, it is important that I know how the price components are distributed. In particular, a price-related variable that draws attention is cleaning_fee, whose distribution indicates rather high variability. This may indeed be an influential factor in further analysis.

Tables: Logical Variables

For the logical values I create a table counting FALSE and TRUE for each logical variable:

#I tidy logical variables
logical_values_clean <- listings_necessary %>%
  select_if(is.logical) %>%  #I  check for logical values
  select(-c("neighbourhood_group_cleansed")) %>%   #I do not want to include neighbourhood_group_cleansed, since it is also a logical variable
  pivot_longer(names_to = "type", values_to ="values", cols=1:6) %>%  #I pivot the table in order to have it cleansed
  group_by(type, values) %>%
  summarise(count= n()) %>% 
  pivot_wider(names_from = "values", values_from ="count")

logical_values_clean %>% 
  mutate(type = case_when(
    type =="host_has_profile_pic" ~ "Host has profile picture",
    type =="host_identity_verified" ~ "Host identity is verified",
    type =="host_is_superhost" ~ "Host is a superhost",
    type =="is_business_travel_ready" ~ "Flat is business-travel ready",
    type =="is_location_exact" ~ "Location is exact",
    type =="instant_bookable" ~ "Flat is instantly bookable"
  )) %>% 
kbl(col.names=c("Logical Variable", "False Count","True Count" ), caption="Count of True and False for every Logical Variable") %>%
  kable_styling() # I want a nice style for my table
(#tab:logical_tables)Count of True and False for every Logical Variable
Logical Variable False Count True Count
Host has profile picture 58 23482
Host identity is verified 14835 8705
Host is a superhost 20878 2662
Flat is instantly bookable 16722 6818
Flat is business-travel ready 23540 NA
Location is exact 4966 18574

The count of true values in “Flat is business travel ready” is NA because every value is FALSE.

As for this step of the analysis I may observe that logical variables such as: host_has_profile_pic, is_business_travel_ready and host_is_superhost would probably not account for much of variability in price (the outcome variable of my analyses), since most of their corresponding values are rather homogeneous. Contrastingly, due to their lack of homogeneity, it seems that host’s identity, ability to instant book the flat and exact location of flat may account for variability in prices to a greater degree.

Barcharts: Categorical Variables

Now, I create barcharts to evaluate my categorical variables:

factor_longer <-  factor %>%  #I tidy the data into a neat format
  select(-property_type) %>% 
    pivot_longer(names_to= "variable_name", values_to="values", everything() ) %>% 
  group_by(variable_name,values) %>%   #I precount variables in order to sort them
  mutate(count_name_occurr = n() )

ggplot(factor_longer, aes(y=reorder(values,count_name_occurr) , na.rm=TRUE))+ # I create a sorted barplot with variables
  geom_bar(fill="grey", orientation ="y")+
  facet_wrap(vars(variable_name), scales="free", ncol=3)+
  labs(title="Exploring the Distribution of Categorical Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Faceted bar chart of selected categorical variables")+
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 

  NULL

I may observe that the categorical variable bed_type will probably not account for much variability in price, since most of its corresponding values are homogeneous. However, consistent with this rationale, it seems that neighborhood, room_type, prop_type_simplified and cancellation_policy may account for variability in prices to a greater extent, since they are more evenly distributed and thus less homogeneous.

Numeric Variable Correlations: Correlation Matrix

I inspect the numeric variables for correlations among them. Therefore, I create a correlation matrix and visualize it with the function corrplot(). I excluded some variables due to above mentioned reasoning, and also excluded all review-related variables, as I will look at them separately.

correlation_matrix <- num %>% 
  select(-c(number_of_reviews_ltm, bedrooms_per_capacity, beds_per_capacity, reviews_per_month, accommodates_simplified, host_response_rate, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month)) %>% 
  cor() 

corrplot(correlation_matrix,
         method="color",
         type="full")

In the correlation plot above, bluer shades represent stronger positive correlations, and redder shades reflect stronger negative correlations. I observe that much of the correlation appears within specific groups of variables (e.g. listing-type related variables or review related variables), which is not surprising. Correlation between variables such as bedrooms, beds, accommodates and price was also anticipated, since more of any of these things implies (in most cases) more of the others.

Combined Variable Types: ggpairs() Visualisations

Review-Related Variables

Because all the review-related variables seem to highly correlate on the heathmap. I look at them in detail:

listings_necessary %>% 
  select(review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month) %>% 
  ggpairs(aes(alpha=0.1))+
  theme_bw()

As expected, the correlations among the review-related variables are very high, but none of the values is higher than 0.67 so I do not need to worry about them.

Property-Related Variables

Now, I evaluate whether property-related variables correlate:

listings_necessary %>% 
  select(square_feet, accommodates_simplified, bedrooms, beds, bathrooms_per_capacity, prop_type_simplified) %>% 
  ggpairs(aes(alpha=0.1, color= prop_type_simplified))+
  theme_bw()

Here I can see that some listing-related variables do correlate strongly. Beds and Bedrooms, for example, exceed 0.7 for House listings and other property types. It is also highly noticeable that property types bundled as “Other” correlate more strongly with most other variables than the other clearly-defined property types. In order to avoid problematic multicollinearity, I need to take these relationships into consideration when building my models.

A Selection of Interesting Variables

Having evaluated the key correlations between both Review-Related and Listing-Related variables, I now evaluate a mix of interesting variables and see if they correlate with one another, in an effort to glean a better understanding of which variable combinations may cause problematic multicollinearity in my final model:

listings_necessary %>%
 select(number_of_reviews, review_scores_rating, bathrooms, accommodates, host_is_superhost) %>%
   GGally::ggpairs(aes(alpha = 0.4, colour=host_is_superhost),
                 ) +
  theme_bw()

Reassuringly, I detect no strong correlation between these values, which encourages us to employ them together in a model to predict the price of a 4 night stay in an Copenhagen AirBnB for 2 guests.

Mapping

Map of Copenhagen AirBnBs by Nightly Price

#I set colors for price 
price_color <- colorNumeric(palette = "Reds",
                            domain = c(1:2000),
                            reverse = FALSE)

#I initiate the leaflet map
map <-leaflet(data = listings_necessary) %>% 
  
  # Map with OpenStreetMap.Mapnik
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  
  # Map centered on the coordinate of Copenhagen with zoom of 11
  setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>% 

  # Add circle markers with popups and labels
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~price_color(price), 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type)

#print map
map

As would be expected, I find that AirBnBs are most densely clustered in the center of the city, and are highest priced in the most popular tourist spots, including Gammel Strand, Norreport and Kongens Nytorv. The most dense clusterings can be found around the lakes in Norrebro, particularly on the North West side, which is less green than the Southern bank. Interestingly, both the sea bank to the west and also the worse connected area of Brygge also command high Airbnb prices, especially relative to areas such as Fredriks Brygge to the North West. In fact it is clear that the further North or South from the city center in Kongens Nytorv and Nyhavn, the lower priced AirBnBs are on average. This holds to a greater extent further from the coastline - despite being relatively far from the center, Amager still commands relatively high Airbnb prices.

However, this map has a huge amount of datapoints. Though this is not problematic for identifying which areas have expensive or cheap AirBnBs, which makes it hard to identify which areas have the highest density of AirBnBs. Therefore, I cluster the data points for ease of interpretation.

Map of Copenhagen Airbnbs with Clustering

map_cluster <-leaflet(data = listings_necessary) %>% 
  
  # Map with OpenStreetMap.Mapnik
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  
  # Map centered on the coordinate of Copenhagen with zoom of 11
  setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>% 

  # Add circle markers with popups and labels
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type,
                   
                   # cluster 
                   clusterOptions = markerClusterOptions())
#print map_cluster
map_cluster

This map demonstrates how central Copenhagen has over 17000 separate AirBnB listings, over 5x more than Amager, the region with the next highest number of listings. As would be expected, the area I identified as commanding the highest Airbnb prices, the popular tourist areas of Norreport, Kongens Nytorv and Nyhavn, has the highest number (5000) of listings. Further, Norrebro and Southern Osterbro, i.e. the lakes of the city, have almost 3000 Airbnb listings in a relatively small area (especially since much of it is occupied by water). As I saw in the earlier map, there are also extensive listings in Norrebro, despite commanding on average a lower price than the very center of the city.

Now I know where the most expensive AirBnBs are, and the neighborhoods in which most AirBnBs are clustered, I are interested in knowing whether there is a geographical pattern to the type of AirBnB’s available in Copenhagen. Given the high price of AirBnBs in the city center, perhaps I would expect more Townhouses and Houses, but perhaps the price of the listings is just a function of their location, and therefore the opposite may be true - all the expensive inner-city listings may only be Apartments or Condominiums. Since these are mere speculations, I now investigate potential relationships further.

Map of Copenhagen Airbnbs by Property Type

# group by prop_type_simplified
apartment <- listings_necessary %>% 
  filter(prop_type_simplified == 'Apartment')

condominium <- listings_necessary %>% 
  filter(prop_type_simplified == 'Condominium')

house <- listings_necessary %>% 
  filter(prop_type_simplified == 'House')

townhouse <- listings_necessary %>% 
  filter(prop_type_simplified == 'Townhouse')
# set colors for price of different property types
apartment_color <- colorNumeric(palette = "Reds",
                            domain = c(1:2000),
                            reverse = FALSE)

condominium_color <- colorNumeric(palette = "Blues",
                            domain = c(1:2000),
                            reverse = FALSE)

house_color <- colorNumeric(palette = "Greens",
                            domain = c(1:2000),
                            reverse = FALSE)

townhouse_color <- colorNumeric(palette = "Yellows",
                            domain = c(1:2000),
                            reverse = FALSE)
# initiate leaflet map
map_prop_type <-leaflet() %>% 
  
  # Map with OpenStreetMap.Mapnik
  addProviderTiles("OpenStreetMap.Mapnik") %>% 

  # Add circle markers with popups and labels for apartment
  addCircleMarkers(data = apartment,
                   group = "apartment",
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~apartment_color(price), 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type) %>% 

  # Add circle markers with popups and labels for condominium
  addCircleMarkers(data = condominium,
                   group = "condominium",
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~condominium_color(price), 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type) %>% 

   # Add circle markers with popups and labels for house
  addCircleMarkers(data = house,
                   group = "house",
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~house_color(price), 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type) %>% 
    
  # Add circle markers with popups and labels for townhouse
  addCircleMarkers(data = townhouse,
                   group = "townhouse",
                   lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~townhouse_color(price), 
                   fillOpacity=0.6,
                   popup = ~listing_url,
                   label = ~property_type) %>% 
  
  # set layers by property type
  addLayersControl(overlayGroups = c("apartment", "condominium","house","townhouse")) %>% 


  # Map centered on the coordinate of Copenhagen with zoom of 11
  setView(lat = 55.6761, lng = 12.5683, zoom = 11) 

# print map_prop_type
map_prop_type

This map allows us to identify some patterns in the distribution of certain types of AirBnB listings across Copenhagen. If I reduce the map down (using the layers in the top right hand corner) to just houses, I find that most AirBnB listings for houses are in Bronshoj, Uttersley and Southern Amager, close to Kastrup. Generally, the same is also true for Townhouses, which are scattered largely in Amager, Fredriksberg and Bronshoj. However, Townhouses also seem to be clustered (into small clusters of 5-10 listings) on the coastline in the North East of the city, and also on the Southern side of the lakes, towards their Eastern limit on the way out of the city center and towards Osterbro, which is far more residential. As I anaklyse Condominiums, I find the price pattern I observed earlier becomes clearer - the more central the AirBnB, the higher the price. Though there are AirBnB condominiums across the whole city, these do seem largely clustered around the lakes, Christianshavn, and Vesterbro, with the most expensive listings being in Kongens Nytorv and Fredriksberg. This mapping also allows us to see how Apartments are by far the most common listing of all in Copenhagen, with more Apartment listings than all other types combined, and that in general the price of AirBnB listings which are apartments falls sharply on leaving the main waterways of the city center.

Most evidently, by separating out listing types in this way, I see how geographical heterogeneity in price differs between listing types. Specifically, I find that while Apartment and Condominium prices decline as the distance to the city center increases, House and Townhouse prices are not overtly correlated with centrality in the same way - some of the most expensive House and Townhouse listings being on the Eastern coastline, the Northern lakes, and way out of the city in Valby, Vanlose and to the North of Fredriksberg.

Geographical Conclusions

From these maps, I can conclude that:

While these are interesting conclusions, the presence of expensive listings of all property types away from the city center, and in less desirable, traditionally residential areas which require use of trains and other public transport to access city sights demonstrates that location (or rather, proximity to city center) cannot be the only predictor of AirBnB listing price. Further, I have only analysed the base listing price so far, and not the actual cost to the AirBnB guest, which includes additional guest fees and cleaning fees.

Therefore, I run a number of regression models on test and training data to glean a greater insight into which factors, besides geography, are associated with the cost of an Airbnb stay in Copenhagen. In order to achieve this, I need to evaluate not only the base list price of Copenhagen AirBnBs, but use a forecast price as my outcome variable. Therefore, I simulate a popular scenario: a 4-night visit to Copenhagen for 2. Using this assumed “trip”, I standardise the prices of all listings to the price that would be paid by 2 visitors staying at the listing, under this scenario.

Regression Analysis

Variable Specification

Creating the ‘price_4_nights’ variable

In order to run my regression analyses and better understand factors associated with listing prices in Copenhagen, I create a new variable, ‘price_4_nights’, which calculates the total cost for two people to stay at each Airbnb property for 4 nights. I apply the following formula to define the variable:

i) if the number of guests included with the base nightly rate is greater than or equal to 2 (i.e. 2 people would only need to pay the base nightly rate), then I calculate the rate for 4 nights by multiplying the base rate by 4 and adding the cleaning fee.

ii) if the number of guests included with the base nightly rate is less than 2 (i.e., it is 1), then I calculate the nightly rate by adding the cost for an additional guest (the second guest) to the base rate, multiplying this sum by 4 and then adding the cleaning fee.

#I create the price_4_nights variable using the case_when() function

listings_necessary <- 
  listings_necessary %>% 
  mutate(
    #my conditional statement says that if the number of guests included in the price is greater than or equal to 2, the price is just the cleaning fee plus 4* the nightly rate, since the 2 people will stay for 4 nights
    price_4_nights = case_when(
      #however, if the number of guests included in the price is less than 2, meaning that a party of 2 would have to pay for exactly 1 extra guest, then the price will be 4* the nightly price, which is the base price plus the price for that extra guest, plus the cleaning fee
      guests_included >= 2 ~ cleaning_fee + 4*price,
      TRUE ~ 4 * (price + extra_people) +cleaning_fee
    )
  )

#when I run skim() to evaluate my new variable, I also find that the minimum value is 0, which will cause errors when log transformed
skim(listings_necessary$price_4_nights)
(#tab:defining_the_target_variable)Data summary
Name listings_necessary$price_…
Number of rows 23540
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 3657 4194 0 2204 3092 4193 276700 <U+2587><U+2581><U+2581><U+2581><U+2581>
#therefore, I examine how many observations have a valuje of 0 for the price_4_nights variable
summary((listings_necessary$price_4_nights)==0)
##    Mode   FALSE    TRUE 
## logical   23539       1
#I find 1 observation has a value of 0 for this variable, and subsequently filter it out
listings_necessary <- listings_necessary %>% 
  filter(price_4_nights != 0)

#I ensure the erroneous price_4_nights 0 value has been removed from the working dataset, and find that it has been
summary((listings_necessary$price_4_nights)==0)
##    Mode   FALSE 
## logical   23539

Preliminary Outcome Variable Analyses

To glean a preliminary understanding of how location is related to price_4_nights, I plot a map of the price_4_nights variable

# set colors for price 
price_color <- colorNumeric(palette = "Reds",
                            domain = c(1:10000),
                            reverse = FALSE)

# initiate leaflet map
stay_price_map <-leaflet(data = listings_necessary) %>% 
  
  # Map with OpenStreetMap.Mapnik
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  
  # Map centered on the coordinate of Copenhagen with zoom of 11
  setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>% 

  # Add circle markers with popups and labels
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~price_color(price_4_nights), 
                   popup = ~listing_url,
                   label = ~property_type)
#print map
stay_price_map

As I would expect, I find that the highest priced AirBnB stays are located in the very center of the city. However, interestingly, the 4 nights for 2 people variable seems to show that even in areas of Osterbro, which are relatively close to the city center, prices are not as high (relative to the other listings) as when I evaluate the distribution of nightly rates. This implies that listings which seemed relatively expensive slightly further from the city center on a nightly basis are actually relatively cheaper on a 4-nightly basis for 2 people, perhaps because they have lower cleaning fees (which I did not account for in the earlier mappings) or include more guests without the need for additional fees.

Though this visualisation is informative, it does not allow us to gain a deep understanding of the nature of my outcome variable, its distribution, or whether it should be transformed in order to generate valid and accurate regressions. Therefore, I analyse the distributions of my outcome variable, and the log of my outcome variable, to evaluate its nature and whether transformation is necessary.

Now, I analyse the distributions of both the ‘price_4_nights’ and log(price_4_nights) variables

#I calculate and report basic summary statistics for the average price for 2 people for 4 nights for Airbnb listings in Copenhagen
summary_price4nights <- summarise(listings_necessary, mean=mean(price_4_nights), median=median(price_4_nights), min=min(price_4_nights), max=max(price_4_nights))
summary_price4nights

#I summarise the distribution of price_4_nights without taking it's log
options(scipen=1000)
no_log_dist <- ggplot(listings_necessary) + geom_histogram(aes(price_4_nights), bins=80) +  scale_x_continuous(labels = scales::dollar, breaks=seq(0, 20000, by = 1000), limits=c(0,20000)) + labs(x="Price: 2 Guests for 4 Nights", y="Count",title="The Distribution of Copenhagen Airbnb Stay Prices is Heavily Positively Skewed",subtitle="Distribution of prices for a 4-night Airbnb stay for a party of 2 in Copenhagen") + theme_fivethirtyeight() + theme(axis.title = element_text()) + geom_vline(xintercept=summary_price4nights$mean, size=0.5, color="red") + geom_vline(xintercept=summary_price4nights$median, size=0.5, color="blue") + annotate(geom = 'text', label = "Sample Mean", x = summary_price4nights$mean+70, y = 800, angle=270, colour="white") + annotate(geom = 'text', label = "Sample Median", x = summary_price4nights$median+70, y = 700, angle=270, colour="white")

#I now re-summarise the distribution of price_4_nights, having taken its log

#I calculate and report basic summary statistics for the average price for 2 people for 4 nights for Airbnb listings in Copenhagen
summary_log_price4nights <- summarise(listings_necessary, mean=mean(log(price_4_nights)), median=median(log(price_4_nights)), min=min(log(price_4_nights)), max=max(log(price_4_nights)))
summary_log_price4nights
#I summarise the distribution of log_price_4_nights without taking it's log
options(scipen=1000)
log_dist <- ggplot(listings_necessary) + geom_histogram(aes(log(price_4_nights)), bins=80) +  scale_x_continuous(labels = scales::dollar, breaks=seq(0,13, by = 1), limits=c(0,13)) + labs(x="Price: 2 Guests for 4 Nights", y="Count",title="The Logged Distribution of Copenhagen Airbnb Stay Prices is Quasi-Normal",subtitle="Distribution of log prices for a 4-night Airbnb stay for a party of 2 in Copenhagen") + theme_fivethirtyeight() + theme(axis.title = element_text()) + annotate(geom = 'text', label = "Sample Median", x = summary_log_price4nights$median+0.15, y = 1700, angle=270, colour="white") +  geom_vline(xintercept=summary_log_price4nights$median, size=0.5, color="blue")

#I plot the non-logged distribution alongside the logged distribution for ease of comparison
no_log_dist + log_dist

Note that in the un-logged distribution of price_4_nights above, I limited the histogram to a maximum price of $20,000 (when the actual maximum is almost $280,000). I added this limit because I believe that the substantial skew of the distribution is more evident when the range of the data is limited in the x-axis, and because the presence of extreme outliers is sufficiently demonstrated by values at $20,000, meaning that plotting extreme outliers at $280,000 is not of analytical importance. Contrastingly, the logged histogram is quasi-normally distributed, with no extreme outliers.

Subsequently, I choose to use the log(price_4_nights) variable in all my analyses

These plots demonstrate two key issues with the un-logged price data: firstly, it is positively skewed, with a broad spread of observations at the upper end of the distribution; and secondly, there are a number of very extreme outliers at the upper end of the distribution, reaching up to $280,000, with only those up to $20,000 shown in the distribution above.

Though normal distribution is not itself an assumption of linear regression (it is only necessary that Y, price_4_nights is normally distributed at each value of X, i.e. there is no heteroskedasticity), a normally distributed Y is favourable, since it is more likely to meet this assumption across many models. Therefore, in finding that log(price_4_nights) is distributed log normally, I favour this transformed variable in my subsequent regressions.

Splitting the Data into Test and Training Data

Since adjusted R squared can be misleading and lead to overfitting, I evaluate out-of-sample goodness of fit. This requires us to randomly partition the data into a training dataset containing 75% of the data, which is used to fit the model, and 25% testing data, used to assess model performance.

Throughout my model selection process, I build and interpret my models using all the data, before using my training data and testing data to evaluate overfitting via the Root Mean Squared Error (RMSE) and Adjusted R-Squared (R2).

set.seed(1234)
# create train_test_split, 75% for training 25% for testing
train_test_split <- initial_split(listings_necessary, prop = 0.75) 
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)

Fitting Regressions, Evaluating Diagnostics and Testing for Collinearity

For ease of interpretation, I use the term ‘stay’ to describe a 4-night stay in a Copenhagen Airbnb for 2 people. Therefore, the variable price_4_nights is the price for my ‘stay’.

Further, I apply a conventional 0.05 alpha level throughout, meaning that my coefficients are deemed statistically significant if their p-value is less than 0.05. When I say that a coefficient is statistically significant at all conventional significance levels, it is effectively 0, since conventional significance levels refers to alpha levels of 0.05, 0.01, 0.001 and 0.0001. Since the p-value is lower than all of these levels, it is deemed robustly statistically significant.

Model 1

Next, I fit a regression model (model1) which takes log(price_4_nights) as its outcome variable, and 3 explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating

#I run a standard linear regression model with a selection of base variables
model1 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating,
             data = listings_train)

#I report the output of the linear regression model, with a 95% confidence interval, as a neat table
model1 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1708.02 1.06 127.20 0.000 1522.93 1915.60
prop_type_simplifiedCondominium 1.08 1.02 4.56 0.000 1.04 1.11
prop_type_simplifiedHouse 1.29 1.02 12.62 0.000 1.24 1.34
prop_type_simplifiedOthers 1.60 1.02 22.80 0.000 1.53 1.66
prop_type_simplifiedTownhouse 1.42 1.03 11.75 0.000 1.34 1.50
number_of_reviews 1.00 1.00 1.15 0.251 1.00 1.00
review_scores_rating 1.01 1.00 9.56 0.000 1.00 1.01
#I evaluate initial model performance on the full dataset, specifically in terms of adjusted R2, and print this in a neat table
model1 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R Squared","Adjusted R Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R Squared Adjusted R Squared Sigma t-statistic p-value Degrees of Freedom
0.047 0.047 0.507 146 0 6

A Note on Interpreting my Anti-Logged Coefficients

Since my Y (outcome variable) is logged, I take the exponent, exp(), of the estimate to aid interpretibility. Specifically, this exp() coefficient tells us the ratio of the expected geometric mean price_4_nights for the coefficient group, versus the base group (for factor variables), or the ratio of the expected geometric mean for price_4_nights for each additional point increase in the explanatory variable (for numeric variables), ceteris paribus.

prop_type_simplified

Therefore, I can interpret these coefficients as percentage changes: a coefficient of 1.07 for Condominium implying that, relative to the base property type, Apartment, Condominium AirBnB listings are, on average, associated with a 7% higher price for my ‘stay’, ceteris paribus. In Model 1, I find that all levels of prop_type_simplified are statistically significant, at all conventional significance levels. Type “Others”, which includes listings which are not Apartments, Condominiums, Houses or Townhouses, is associated most strongly with ‘stay’ price: a ‘stay’ at an “Other” type of property is associated with a price 60% higher, on average, than an Apartment ‘stay’. Similarly, ceteris paribus, Townhouse ‘stays’ are associated with a 42% higher price, and House ‘stays’ are associated with a 29% higher price, than an Apartment Stay on average.

review_scores_rating

I find that each additional point rating in review score is associated with a 0.5% increase in the price of my ‘stay’, ceteris paribus. This estimate is also statistically significant at all conventional significance levels, and thus that, on average, stays at listings with higher ratings are more expensive than those at lower rated listings. Though this seems like a low coefficient, inevitably a 10-point difference in rating is associated with a relatively substantial 5% price increase - and thus the effect is cumulative and is likely to be substantive for most Airbnb guests.

Model 1 Performance

On evaluating the performance of my first model, I find that its adjusted r squared of 4.7% is extremely low. I use the adjusted R squared because this is far less likelto to be upwardly biased by the addition of new variables: it only increases if the new predictor enhances my model beyond what would be obtained by probability.

Explaining less than 5% of variation in my price_4_nights outcome variable, it is evident that these predictors are insufficient to allow effective prediction of the price of my stay. Therefore, I add another variable, room_type, to see if I can increase the performance of my model.

Model 1 Diagnostics

#I evaluate whether the 4 core assumptions of linear regression (Linear, Independence, Normal, Equality) hold
autoplot(model1, alpha = 0.3, label.size = 3) +
  theme_bw()

I find that the residuals are patterned, indicating that there is a pattern (or multiple patterns) in my data which is currently not accounted for by the model itself. Further, I find that the residuals follow a quasi-normal distribution, which indicates my linear regression normality assumption is tenable. In terms of scale-location, the equal (or constant) variance assumption appears to hold, since there is no positive or negative trends across the fitted values. Finally, I identify quite a few observations for Apartment listings with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.

#I use the Variance Inflation Factor (VIF) to evaluate if multicollinearity is problematic for my model
kbl(vif(model1), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1 4 1
number_of_reviews 1 1 1
review_scores_rating 1 1 1

Since all VIF shown are less than 5, I can safely conclude that this model does not have collinearity problems, and no variables require removal.

#I run the model on training data and construct both a predictions and RMSE output
rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model1, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#I run the model on test data and construct both a predictions and RMSE output
rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model1, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

#I pull the R2 value from the training model
r2_train <-
  summary(model1)$r.squared

#I pull the R2 value from the testing model
r2_test <-
  cor(predict(model1, listings_test), listings_test$price_4_nights)

#I summarise the 4 values I have created or extracted in a neat table for evaluation
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-squared"))),
  ) %>% 
  kable_styling()
RMSE R-squared
train 64597 0.047
test 51236 0.094

Strangely, the RMSE (Root Mean Squared Error) and R2 is considerably higher for the training dataset than the testing dataset, with the R2 of 9.6% for test data being over twice as high as that for training data. Regardless of the fact that the model appears to fit the out-of-sample testing data better than the training data, this considerable difference across RMSE and R2 suggests that the model may suffer from overfitting.

However, preliminary research suggests that the lower test error indicates that the model generalises well, but that the training set is considerably larger than the test set, as is the case in my analyses.

Model 2

As I are also interested in determining if room_type is a significant predictor of the cost of the 4 nights (Y), ceteris paribus, I fit another regression model which includes the same explanatory variables as model1, plus the room_type variable

#see Model1 for detailed annotation of regression model creation and diagnostic stages
model2 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type,
             data = listings_necessary)

model2 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 2308.795 1.05 169.69 0.000 2111.231 2524.847
prop_type_simplifiedCondominium 1.036 1.01 2.82 0.005 1.011 1.062
prop_type_simplifiedHouse 1.316 1.02 17.81 0.000 1.277 1.357
prop_type_simplifiedOthers 1.566 1.02 27.65 0.000 1.517 1.617
prop_type_simplifiedTownhouse 1.431 1.02 15.57 0.000 1.368 1.497
number_of_reviews 1.001 1.00 10.71 0.000 1.001 1.001
review_scores_rating 1.004 1.00 7.69 0.000 1.003 1.005
room_typeHotel room 1.309 1.08 3.56 0.000 1.129 1.518
room_typePrivate room 0.539 1.01 -76.22 0.000 0.531 0.548
room_typeShared room 0.410 1.05 -16.78 0.000 0.369 0.455
model2 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.242 0.241 0.455 833 0 9

room_type

Applying the same interpretation methodology as for model1, I find that (ceteris paribus) relative to listings for an entire home or apartment, stays in Private Rooms or Shared Rooms are associated on average with a 46%, or 59% lower stay price, respectively. Contrastingly, stays in hotel rooms are associated, on average, with a 31% higher stay price relative to listings for an entire home or apartment. All three of these coefficients are statistically significant at all conventional significance levels.

Model 2 Performance

I find that the adjusted R-squared of my second model is considerably higher than that of my first (4.6%), at 24%. This means that the addition of room_type has substantially increased the predictive power of my model, since it now explains a quarter of variance in the price of my stay. In order to try to improve the predictive power of my model, I add explanatory variables, including variables for the number of bathrooms, bedrooms, and beds, and house size.

Model 2 Diagnostics

autoplot(model2, alpha = 0.3, label.size = 3) +
  theme_bw()

Much like in Model 1, in Model 2 I find that the residuals are patterned, indicating that there is a pattern (or multiple patterns) in my data which remains unaccounted for by the model itself. Again, I also find that though the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption seems to hold less for Model 2 than Model 1, since at the upper and lower limits of fitted values there seems to be a negative, followed by a more extreme positive kink in the standardised residuals. Further, I identify quite a few observations with high absolute residuals, again indicating that these observations may have undue influence on estimates of model parameters.

kbl(vif(model2), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.05 4 1.01
number_of_reviews 1.02 1 1.01
review_scores_rating 1.01 1 1.00
room_type 1.07 3 1.01

Again, in Model 2, all VIF shown are less than 5, and thus I can safely conclude that this model does not have collinearity problems, and hence no variables require removal.

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model2, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model2, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model2)$r.squared

r2_test <-
  cor(predict(model2, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-squared"))),
  ) %>% 
  kable_styling()
RMSE R-squared
train 54463 0.242
test 44868 0.135

On comparison of the RMSE and R2 of my second model, I find that, again, the error associated with my training data is greater than that of the model on test data, which may be an artifact of small testing sample and large training sample. Despite this, the substantial difference in RMSE and also the fact R2 for the model on training data (24.2%) is almost double that of the R2 on test data (13.5%) suggests that the model suffers from overfitting.

Model 3

Now I explore whether bathrooms, bedrooms, beds and house size (accommodates) are significant predictors of the price of my stay (price_4_nights)

#see Model1 for detailed annotation of regression model creation and diagnostic stages
model3 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               bathrooms +
               bedrooms +
               beds +
               accommodates, data = listings_necessary)

model3 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1174.003 1.04 167.787 0.000 1080.960 1275.054
prop_type_simplifiedCondominium 1.026 1.01 2.262 0.024 1.003 1.048
prop_type_simplifiedHouse 0.921 1.01 -5.584 0.000 0.895 0.948
prop_type_simplifiedOthers 1.359 1.01 21.055 0.000 1.321 1.399
prop_type_simplifiedTownhouse 1.018 1.02 0.832 0.406 0.976 1.061
number_of_reviews 1.001 1.00 9.514 0.000 1.001 1.001
review_scores_rating 1.004 1.00 10.200 0.000 1.004 1.005
room_typeHotel room 1.596 1.07 6.930 0.000 1.398 1.822
room_typePrivate room 0.666 1.01 -51.947 0.000 0.656 0.677
room_typeShared room 0.527 1.05 -13.474 0.000 0.481 0.579
bathrooms 1.165 1.01 13.907 0.000 1.140 1.191
bedrooms 1.046 1.00 13.750 0.000 1.039 1.053
beds 1.008 1.00 2.690 0.007 1.002 1.014
accommodates 1.117 1.00 38.574 0.000 1.110 1.123
model3 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.398 0.398 0.406 1199 0 13

I find that each of the explanatory variables added (bathrooms, bedrooms, beds and accommodates) have statistically significant coefficients at all conventional significance levels. This means that each is a statistically significant predictor of my outcome variable, the price of a 4 night stay for 2 guests in Copenhagen.

Here, I offer a brief analysis of the coefficients on each variable.

bathrooms

The coefficient on bathrooms demonstrates that ceteris paribus, each additional bathroom in an Airbnb is associated with, on average, a 16.5% higher stay price.

bedrooms

The coefficient on bedrooms demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 4.6% higher stay price.

beds

The coefficient on beds demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 0.8% higher stay price.

accommodates

The coefficient on bedrooms demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 11.7% higher stay price.

Model 3 Performance

Again, I find that the addition of these variables has increased the adjusted R-squared of my model. My model now explains approximately 40% of variation in the price of my stay, which despite being suboptimal for accurate prediction, is now considerably better than could be obtained by chance or by the mean and a substantial improvement from Model 2, which explained just 24% of variance. It is evident that so far, by adding explanatory variables, I have improved my model considerably. Therefore, I continue to adjust my model iteratively to investigate whether other features of Copenhagen AirBnB listings are statistically significant predictors of the price of my stay.

Model 3 Diagnostics

autoplot(model3, alpha = 0.3, label.size = 3) +
  theme_bw()

I find that the residuals are patterned, since they are clustered at the lower end of the fitted values, indicating that there is a pattern in my data which Model 3 still fails to account for. Further, I find that the residuals do not generally follow a quasi-normal distribution, which indicates my linear regression normality assumption may be untenable. In terms of scale-location, the equal (or constant) variance assumption appears to hold, since the positive trend across the fitted values is not observed when I take out instance 17340, as in the residuals vs fitted plot. Finally, I again identify quite a few observations with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.

kbl(vif(model3), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.28 4 1.03
number_of_reviews 1.02 1 1.01
review_scores_rating 1.01 1 1.00
room_type 1.26 3 1.04
bathrooms 1.26 1 1.12
bedrooms 1.74 1 1.32
beds 2.41 1 1.55
accommodates 2.97 1 1.72

As in Models 1 and 2, all VIF shown are less than 5, and thus I can safely conclude that this model does not have collinearity problems despite the VIF on beds and accommodates being unusually high (for obvious reasons) relative to the other explanatory variables. Thus, no variables require removal.

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model3, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model3, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model3)$r.squared

r2_test <-
  cor(predict(model3, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-Squared"))),
  ) %>% 
  kable_styling()
RMSE R-Squared
train 39777 0.398
test 38709 0.231

On comparison of the RMSE and adjusted R-squared of my third model, I find that, again, the error associated with my training data is greater than that of the model on test data. Much like for Model 2, I do however also find that the R2 for the model on training data (39.8%) is considerably higher than that of the R2 on test data (23.1%), which suggests that my model still suffers from overfitting.

Model 4

I now evaluate whether Superhosts command a pricing premium, as I continue to control for all previous explanatory variables

#see Model1 for detailed annotation of regression model creation and diagnostic stages
model4 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost, data = listings_necessary)


model4 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1221.273 1.04 167.920 0.000 1124.040 1326.917
prop_type_simplifiedCondominium 1.024 1.01 2.129 0.033 1.002 1.047
prop_type_simplifiedHouse 0.922 1.01 -5.578 0.000 0.895 0.948
prop_type_simplifiedOthers 1.358 1.01 21.005 0.000 1.320 1.397
prop_type_simplifiedTownhouse 1.015 1.02 0.723 0.470 0.974 1.058
number_of_reviews 1.001 1.00 6.489 0.000 1.000 1.001
review_scores_rating 1.004 1.00 9.101 0.000 1.003 1.005
room_typeHotel room 1.603 1.07 7.005 0.000 1.405 1.829
room_typePrivate room 0.664 1.01 -52.433 0.000 0.654 0.674
room_typeShared room 0.528 1.05 -13.463 0.000 0.481 0.580
bathrooms 1.165 1.01 13.905 0.000 1.140 1.190
bedrooms 1.046 1.00 13.805 0.000 1.039 1.053
beds 1.009 1.00 2.872 0.004 1.003 1.014
accommodates 1.116 1.00 38.387 0.000 1.110 1.122
host_is_superhostTRUE 1.077 1.01 8.374 0.000 1.058 1.096
model4 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.4 0.4 0.405 1121 0 14

Superhost

My fourth regression model demonstrates clearly that whether the Airbnb host is a superhost is a statistically significant predictor of the price of my stay at all conventional significance levels. Specifically, I find that ceteris paribus, on average, if the host is a superhost (versus not being a superhost), the price of my stay is 7.7% higher.

Model 4 Performance

Though the newly included superhost predictor is statistically significant, the performance of this model has only improved marginally from an adjusted R-squared of 39.8% to approximately 40.0%, and thus this explanatory variable has not substantively improved the proportion of variance in price_4_nights explained by my model. Therefore, I continue to explore further predictors which may improve the predictive power of my model.

Model 4 Diagnostics

autoplot(model4, alpha = 0.3, label.size = 3) +
  theme_bw()

I find that the residuals are patterned, since they cluster in the lower range of fitted values, indicating that there is a pattern in my data which is still not accounted for by the model itself. Again, the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption seems to hold, since the is positive trend across the fitted values is predominantly associated to one outlier. Finally, as in all previous models (Models 1-3), I identify quite a few observations with high standardized residuals, indicating that these observations may have undue influence on estimates of model parameters.

kbl(vif(model4), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.28 4 1.03
number_of_reviews 1.13 1 1.06
review_scores_rating 1.03 1 1.01
room_type 1.27 3 1.04
bathrooms 1.26 1 1.12
bedrooms 1.74 1 1.32
beds 2.41 1 1.55
accommodates 2.97 1 1.72
host_is_superhost 1.13 1 1.06

As in Models 1, 2 and 3, all VIF shown are less than 5, and thus I can safely conclude that this model does not have collinearity problems despite the VIF on beds and accommodates being unusually high, as I observed in Model 3 diagnostics. Thus, no variables require removal at this stage.

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model4, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model4, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model4)$r.squared

r2_test <-
  cor(predict(model4, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-Squared"))),
  ) %>% 
  kable_styling()
RMSE R-Squared
train 39651 0.400
test 38589 0.231

On comparison of the RMSE and R2 of my fourth model, I find that, again, the error associated with my training data is greater than that of the model on test data. Much like for Models 2 and 3, I do however also find that the R2 for the model on training data (40.0%) is considerably higher than that of the R2 on test data (23.1%), which suggests that my model still suffers from overfitting.

Model 5

Subsequently, I investigate whether the listing’s location (exact or inexact) is a significant predictor of the price of the stay

Logically, I may expect that if an exact location is provided, guests may be willing to pay a higher price for their stay simply for the reduced uncertainty (or security) this information provides them ahead of their visit.

#see Model1 for detailed annotation of regression model creation and diagnostic stages
model5 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost +
               is_location_exact, data = listings_necessary)

# exact location is not a significant predictor, so I exclude it in my next step regression

model5 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1227.822 1.04 167.235 0.000 1129.612 1334.569
prop_type_simplifiedCondominium 1.024 1.01 2.099 0.036 1.002 1.046
prop_type_simplifiedHouse 0.922 1.01 -5.582 0.000 0.895 0.948
prop_type_simplifiedOthers 1.355 1.01 20.711 0.000 1.316 1.394
prop_type_simplifiedTownhouse 1.015 1.02 0.682 0.495 0.973 1.058
number_of_reviews 1.001 1.00 6.512 0.000 1.000 1.001
review_scores_rating 1.004 1.00 9.123 0.000 1.003 1.005
room_typeHotel room 1.607 1.07 7.042 0.000 1.408 1.834
room_typePrivate room 0.664 1.01 -52.438 0.000 0.654 0.674
room_typeShared room 0.528 1.05 -13.455 0.000 0.482 0.580
bathrooms 1.165 1.01 13.909 0.000 1.140 1.190
bedrooms 1.046 1.00 13.800 0.000 1.039 1.053
beds 1.009 1.00 2.900 0.004 1.003 1.014
accommodates 1.116 1.00 38.400 0.000 1.110 1.122
host_is_superhostTRUE 1.078 1.01 8.417 0.000 1.059 1.096
is_location_exactTRUE 0.992 1.01 -1.279 0.201 0.979 1.004
model5 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.4 0.4 0.405 1047 0 15

exact location

I can see clearly that the p-value (0.20) for exact location is not less than 0.05, my alpha level, and thus this is not a statistically significant predictor of the price of my stay. Counter to my expectations, this may be the case since Copenhagen has strict property rental, purchase and letting legislation, which is regularly and heavily enforced. As such, the proability of a “false” listing, and thus of any risk or uncertainty associated with location, is low - and thus hosts may see no price premium in providing the exact location relative to cities with less regulation on property services like Airbnb.

Model 5 Performance

Re-evaluating my model’s performance, I find that my adjusted R squared is effectively unchanged (at 40.5%, from 40.0% in Model 4), which is to be expected since exact location is not a statistically significant predictor of my outcome variable, price_4_nights.

Having included explanatory variables for key characteristics of Airbnb listings which I would expect to be closely related to the price of my Airbnb stay, I now turn to evaluating the relationship between price and location more precisely.

Since exact location is not a statistically significant predictor of the price of my stay and does not substantially increase the adjusted R squared of my model, I do not include it in subsequent regressions.

Model 5 Diagnostics

autoplot(model5, alpha = 0.3, label.size = 3) +
  theme_bw()

Again, I find that the residuals are patterned (though seemingly less so than in Model 4), since they cluster in the lower range of fitted values but not as extremely. This indicates that there are still underlying patterns in my data which are not accounted for by the model itself. Further, I observe the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like in Models 3 and 4, since there is a no clear positive trend across the fitted values if I take out the outliers. Finally, as in all previous models (Models 1-4), I identify quite a few observations with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.

kbl(vif(model5), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.30 4 1.03
number_of_reviews 1.13 1 1.06
review_scores_rating 1.03 1 1.01
room_type 1.27 3 1.04
bathrooms 1.26 1 1.12
bedrooms 1.74 1 1.32
beds 2.41 1 1.55
accommodates 2.97 1 1.72
host_is_superhost 1.13 1 1.06
is_location_exact 1.02 1 1.01

As before, all VIF shown are less than 5, and thus I can safely conclude that this model does not have collinearity problems. I make no removals due to collinearity, but rather only due to a lack of significance of the is_location_exact predictor.

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model5, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model5, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model5)$r.squared

r2_test <-
  cor(predict(model5, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-Squared"))),
  ) %>% 
  kable_styling()
RMSE R-Squared
train 39643 0.400
test 38611 0.231

Whereas Models 1 through 4 demonstrated that the model had a higher RMSE on training data than test data, I now find what would be expected in a traditional case of overfitting: that the model has a lower RMSE on training data and a considerably higher R2, exactly as in Model 4 (40.0% versus 23.1%). Again, and unconventionally, I find the error associated with my training data is greater than that of the model on test data. However, the difference between the RMSE is, as in Model 4, relatively small (the RMSE on the training data is less than 3% higher than that on the test data).

Model 6: Regressions for Neighbourhood Clusters

Though my map showing the relative price of stays at Airbnb’s provided a good impression of the geographical distribution of Copenhagen Airbnb stays by price and property type, I now delve deeper into how the price of Airbnb stays varies across distinct neighbourhoods within the city.

First, I specify 5 levels in a variable called “neighbourhood_simplified”, which groups together nearby and demographically neighbourhoods

As an example, I have grouped together Fredriksberg and Vanlose not only because they are adjacent, but because they are both large city districts predominantly characterised by residential property and large parkland, away from the tourist-dominated city centre.

Broadly, I divide city neighborhoods into the categories: North, North West, West, South, and East. Though some of these categories are immediately (geographically) adjacent to one another, where possible I have ensured that the constituents share similar socio-demographic characteristics, and thus there is an undeniably distinct character to each of the 5 neighbourhoods I have specified.

# I group all neighbourhoods into North, East, West, South, Northwest using case_when, for the whole dataset

listings_necessary <- listings_necessary %>% 
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
    neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
    neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
    neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',  
    TRUE ~ 'North'
  ))

# I group all neighbourhoods into North, East, West, South, Northwest using case_when, for the training dataset

listings_train <- listings_train %>% 
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
    neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
    neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
    neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',  
    TRUE ~ 'North'
  ))

# I group all neighbourhoods into North, East, West, South, Northwest using case_when, for the testing dataset

listings_test <- listings_test %>% 
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
    neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
    neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
    neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',  
    TRUE ~ 'North'
  ))

Next, I use this new simplified neighbourhood variable in a new regression to determine whether it is a statistically significant predictor of the price of my stay and can improve the explained variation of my model

#see Model1 for detailed annotation of regression model creation and diagnostic stages

#I run a regression with all model4 variables, as well as the new neighbourhood_simplified explanatory variable

model6 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost +
               neighbourhood_simplified,
               data = listings_necessary)

#I find that grouped neighbourhood is a significant predictor

model6 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1194.863 1.04 168.687 0.000 1100.426 1297.404
prop_type_simplifiedCondominium 1.028 1.01 2.502 0.012 1.006 1.050
prop_type_simplifiedHouse 0.970 1.01 -2.091 0.037 0.943 0.998
prop_type_simplifiedOthers 1.362 1.01 21.662 0.000 1.325 1.401
prop_type_simplifiedTownhouse 1.016 1.02 0.773 0.440 0.976 1.059
number_of_reviews 1.000 1.00 5.005 0.000 1.000 1.001
review_scores_rating 1.004 1.00 8.826 0.000 1.003 1.005
room_typeHotel room 1.587 1.07 6.996 0.000 1.394 1.806
room_typePrivate room 0.670 1.01 -52.315 0.000 0.660 0.680
room_typeShared room 0.546 1.05 -13.022 0.000 0.498 0.598
bathrooms 1.163 1.01 14.051 0.000 1.139 1.188
bedrooms 1.044 1.00 13.645 0.000 1.038 1.051
beds 1.009 1.00 2.963 0.003 1.003 1.014
accommodates 1.117 1.00 39.519 0.000 1.111 1.123
host_is_superhostTRUE 1.073 1.01 8.106 0.000 1.055 1.091
neighbourhood_simplifiedNorth 1.071 1.01 8.939 0.000 1.055 1.088
neighbourhood_simplifiedNorthwest 0.781 1.01 -20.380 0.000 0.763 0.800
neighbourhood_simplifiedSouth 1.046 1.01 4.069 0.000 1.024 1.069
neighbourhood_simplifiedWest 1.097 1.01 9.710 0.000 1.076 1.117
model6 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.424 0.424 0.397 963 0 18

neighbourhood_simplified

This regression demonstrates how, across all levels and at all conventional significance levels, neighbourhood_simplified is a statistically significant predictor of the price of my (4 night) stay for 2 guests in Copenhaghen. Notably, relative to Airbnb listings in the East (Fredriksberg and Vanlose), those in the North West are associated with lower priced stays, while those in the South, North and West are associated with higher priced stays, each to an increasing degree. While, ceteris paribus, the average Airbnb listed in the North West are on average 22% cheaper than those listed in the East, those listed in the South, North and West are on average 5%, 7% and 10% more expensive, respectively.

This is to be expected, since the North West region I specified consists of Bispebjerg and Bronshoj-Husum, both of which are extensive suburbs consisting largely of single family detached homes. Each require a 20 minute drive or 30 minute train to get into the city, which for most visitors is enough of a reason either to seek Airbnbs more centrally in the city, or pay considerably less for their stay in these areas. The West is also an obvious and expected outlier. Since this contains the majority of the most central parts of the city via the neighbourhood Indre By, spanning Nyhavn harbour, Tivoli gardens and most of the cities sights, the average price is notably (and justifiably) higher with no need to commute and bars and restaurants on the ground floor of most properties.

Model 6 Performance

Despite the addition of statistically significant geographical predictors, I find that the adjusted R-squared of my model has not increased greatly since my Model 4 iteration, which had an adjusted R-squared of 40.0%. At just 42.9%, my model still explains less than half of the variation in the price_4_nights variable (the price of my Copenhagen stay). In an effort to do better, I consider whether listings which provide generous cancellation policies (and therefore reduce uncertainty surrounding guests’ visit to the city) are associated with higher priced stays.

Model 6 Diagnostics

#see Model1 for detailed annotation of regression model creation and diagnostic stages
autoplot(model6, alpha = 0.3, label.size = 3) +
  theme_bw()

Again, I find that the residuals are less patterned, compared with previous models, they cluster in the lower range of fitted values and the trend line has a negative gradient deviating from Y=0. This indicates that despite my best efforts, there are remaining underlying patterns in my data which are not accounted for by the model itself, ratifying my underwhelming 40.0% R-squared. Further, the residuals roughly follow a normal distribution.

In terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like in Models 3-5. Finally, though in all previous models I identified quite a few observations with high absolute residuals, there seems to be heavier clustering towards 0, indicating that while some observations may have undue influence on estimates of model parameters, this may be less problematic than in earlier iterations.

#see Model1 for detailed annotation of regression model creation and diagnostic stages
kbl(vif(model6), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.31 4 1.03
number_of_reviews 1.13 1 1.06
review_scores_rating 1.03 1 1.01
room_type 1.27 3 1.04
bathrooms 1.26 1 1.12
bedrooms 1.74 1 1.32
beds 2.41 1 1.55
accommodates 2.97 1 1.72
host_is_superhost 1.13 1 1.06
neighbourhood_simplified 1.04 4 1.01

As before, all VIF shown are less than 5, and thus this model does not appear to have collinearity problems.

#see Model1 for detailed annotation of regression model creation and diagnostic stages

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model6, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model6, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model6)$r.squared

r2_test <-
  cor(predict(model6, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-Squared"))),
  ) %>% 
  kable_styling()
RMSE R-Squared
train 38809 0.424
test 37847 0.237

This comparison demonstrates a similar trend to Models 1-4, where the training data had a higher RMSE than on the test data, however relative to both Models 4 and 5, Model 6 now shows a larger difference between the RMSEs on the two datasets, indicating that overfitting (albeit of a potentially unconventional form, as discussed in Model 1 Diagnostics) is potentially worse in this model. Focussing on the R2, I find that there is a small increase in the difference between test and training data fit relative to Model 5, as the training R2 is now 43% rather than 40%. Consequently, overfitting appears to be a persistent problem across all of my models.

Model 7

Now, I think about the relationship between cancellation policies and the price of my stay

#see Model1 for detailed annotation of regression model creation and diagnostic stages

model7 <-
  
  lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost +
               neighbourhood_simplified +
               cancellation_policy, data = listings_necessary) 

# cancellation policy is a significant predictor 
model7 %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 1144.188 1.04 168.802 0.000 1054.346 1241.686
prop_type_simplifiedCondominium 1.023 1.01 2.071 0.038 1.001 1.045
prop_type_simplifiedHouse 0.971 1.01 -2.071 0.038 0.944 0.998
prop_type_simplifiedOthers 1.250 1.02 12.827 0.000 1.208 1.293
prop_type_simplifiedTownhouse 1.010 1.02 0.468 0.640 0.970 1.051
number_of_reviews 1.000 1.00 3.000 0.003 1.000 1.000
review_scores_rating 1.004 1.00 9.164 0.000 1.003 1.005
room_typeHotel room 1.795 1.07 8.823 0.000 1.576 2.044
room_typePrivate room 0.679 1.01 -50.595 0.000 0.669 0.690
room_typeShared room 0.554 1.05 -12.815 0.000 0.506 0.606
bathrooms 1.161 1.01 13.974 0.000 1.137 1.186
bedrooms 1.043 1.00 13.427 0.000 1.037 1.050
beds 1.011 1.00 3.739 0.000 1.005 1.016
accommodates 1.114 1.00 38.730 0.000 1.108 1.120
host_is_superhostTRUE 1.067 1.01 7.544 0.000 1.049 1.085
neighbourhood_simplifiedNorth 1.067 1.01 8.485 0.000 1.051 1.083
neighbourhood_simplifiedNorthwest 0.784 1.01 -20.235 0.000 0.766 0.803
neighbourhood_simplifiedSouth 1.043 1.01 3.852 0.000 1.021 1.066
neighbourhood_simplifiedWest 1.092 1.01 9.316 0.000 1.072 1.112
cancellation_policymoderate 1.054 1.01 8.639 0.000 1.041 1.067
cancellation_policystrict_14_with_grace_period 1.130 1.01 18.206 0.000 1.116 1.146
cancellation_policysuper_strict_60 1.343 1.03 10.055 0.000 1.268 1.422
model7 %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.434 0.434 0.393 859 0 21

Cancellation Policy

The regression above shows an interesting and unexpected relationship. Though I may hypothesise that listings with more flexible cancellation policies may be more expensive, since guests may be willing to pay a premium for the ability to cancel whenever they like, this is not the case.

The regression output demonstrates that cancellation policy is a statistically significant predictor of the price of my stay across all levels and at all conventional significance levels. Relative to a flexible cancellation policy, the average price of my Airbnb stay, ceteris paribus, increases as the cancellation policy becomes stricter. This is demonstrated by the fact that moderate cancellation policy is associated on average with a 5.6% higher priced stay than an Airbnb with a flexible policy, increasing to 13.4% and 31.1% for strict and super strict cancellation policies, respectively. Thus, it is evident that more expensive stays are associated with stricter cancellation policies, which makes sense since these properties are probably harder to find guests for and thus need to secure their bookings with certainty, versus cheaper properties in very high demand which can be re-booked within hours of a cancellation.

Model 7 Performance

Again, the addition of the statistically significant cancellation_policy explanatory variable has only improved my adjusted R squared by 1%, from 42.9% to 43.9%. Despite the fact that I have gained a considerably better understanding of the main drivers of the price of my Copenhagen Airbnb stay, I are still explaining less than half of the variation in the price of listings for a 4 night stay with 2 guests.

This suggests that a more systematic approach is required to build a model which explains as much variation in the price of my stay as possible, which is what I now turn to exploring.

Model 7 Diagnostics

#see Model1 for detailed annotation of regression model creation and diagnostic stages
autoplot(model7, alpha = 0.3, label.size = 3) +
  theme_bw()

In Model 7, my penultimate model, I find a very similar outcome to my previous diagnostics plots, in that the residuals are less patterned, indicating my most detailed model has captured more key underlying patterns in my data than previous models. Further, the majority of residuals follow the normal distribution as shown in the Normal Q-Q plot

Again, in terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like previous model. Finally, a number of observations with high absolute residuals, indicate that some observations do have undue influence on estimates of model parameters.

Though these diagnostics suggest that the core assumptions of linear regression may not hold perfectly, these deviations are not substantial in most cases. However, I revise my variable selection in an attempt to ensure that as many of these linear regression assumptions as possible do hold, ensuring the validity of outputted coefficients.

#see Model1 for detailed annotation of regression model creation and diagnostic stages

kbl(vif(model7), align="l")%>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.00 4 1.09
number_of_reviews 1.16 1 1.08
review_scores_rating 1.03 1 1.01
room_type 1.32 3 1.05
bathrooms 1.26 1 1.12
bedrooms 1.74 1 1.32
beds 2.41 1 1.55
accommodates 2.99 1 1.73
host_is_superhost 1.14 1 1.07
neighbourhood_simplified 1.05 4 1.01
cancellation_policy 1.62 3 1.08

As in all previous models, I ratify earlier findings that there is no concern over collinearity between my variables, since all explanatory variables have VIFs less than 5.

#see Model1 for detailed annotation of regression model creation and diagnostic stages

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(model7, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(model7, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(model7)$r.squared

r2_test <-
  cor(predict(model7, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-squared"))),
  ) %>% 
  kable_styling()
RMSE R-squared
train 38328 0.434
test 37760 0.240

In my penultimate model, I find the greatest difference in terms of R2, where the model explains 44% of variance in my outcome variable (price_4_nights) on training data, and just 23% on my test data. As in Models 1-6, I find that, again, the error associated with my training data is greater than that of the model on test data. Yet, as in Models 3, 4 and 5, this difference is relatively small, implying that overfitting may not be as severe as R2 alone suggests. However, this warrants further investigation into the nature and potential implications of overfitting in this context, which is beyond the scope of the present analyses.

Regression Summary Tables

In order to summarise my 7 models so far and their performance, I publish a summary table of the predictors I include in each model, their statistical signficance and logged coefficients, their adjusted R-squared and residual standard error.

#I generate a neatly named summary table showing a comparison of all 7 models

huxreg(model1,model2,model3,model4,model5,model6,model7, 
       coefs=c("Property Type | Condominium"="prop_type_simplifiedCondominium",
               "Property Type | House"="prop_type_simplifiedHouse","Property Type | Others"="prop_type_simplifiedOthers","Property Type | Townhouse"="prop_type_simplifiedTownhouse",
               "Number of Reviews"="number_of_reviews", 
               "Review Scores Rating"="review_scores_rating",
               "Room Type | Hotel Room"="room_typeHotel room", 
               "Room Type | Private Room"="room_typePrivate room", 
               "Room Type | Shared Room"="room_typeShared room", "Bathrooms"="bathrooms"))

(1)(2)(3)(4)(5)(6)(7)
Property Type | Condominium0.074 ***0.035 ** 0.025 *  0.024 *  0.023 *  0.027 *  0.022 *  
(0.016)   (0.013)   (0.011)   (0.011)   (0.011)   (0.011)   (0.011)   
Property Type | House0.251 ***0.275 ***-0.082 ***-0.082 ***-0.082 ***-0.030 *  -0.030 *  
(0.020)   (0.015)   (0.015)   (0.015)   (0.015)   (0.014)   (0.014)   
Property Type | Others0.468 ***0.449 ***0.307 ***0.306 ***0.304 ***0.309 ***0.223 ***
(0.021)   (0.016)   (0.015)   (0.015)   (0.015)   (0.014)   (0.017)   
Property Type | Townhouse0.349 ***0.358 ***0.018    0.015    0.014    0.016    0.010    
(0.030)   (0.023)   (0.021)   (0.021)   (0.021)   (0.021)   (0.021)   
Number of Reviews0.000    0.001 ***0.001 ***0.001 ***0.001 ***0.000 ***0.000 ** 
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
Review Scores Rating0.006 ***0.004 ***0.004 ***0.004 ***0.004 ***0.004 ***0.004 ***
(0.001)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
Room Type | Hotel Room        0.269 ***0.467 ***0.472 ***0.475 ***0.462 ***0.585 ***
        (0.076)   (0.067)   (0.067)   (0.067)   (0.066)   (0.066)   
Room Type | Private Room        -0.617 ***-0.406 ***-0.410 ***-0.410 ***-0.401 ***-0.387 ***
        (0.008)   (0.008)   (0.008)   (0.008)   (0.008)   (0.008)   
Room Type | Shared Room        -0.892 ***-0.640 ***-0.638 ***-0.638 ***-0.605 ***-0.591 ***
        (0.053)   (0.047)   (0.047)   (0.047)   (0.046)   (0.046)   
Bathrooms                0.153 ***0.153 ***0.153 ***0.151 ***0.149 ***
                (0.011)   (0.011)   (0.011)   (0.011)   (0.011)   
N17655        23539        23539        23539        23539        23539        23539        
R20.047    0.242    0.398    0.400    0.400    0.424    0.434    
logLik-13053.633    -14877.111    -12150.268    -12115.234    -12114.415    -11634.332    -11431.993    
AIC26123.267    29776.222    24330.536    24262.468    24262.830    23308.663    22909.985    
*** p < 0.001; ** p < 0.01; * p < 0.05.
Though I have already performed a relatively extensive comparison of models 1-7, there are still many variables I haven’t explored. In the next section, I included a greater number of explanatory variables and constructed new variables to form the best model for predictions.

Final Model Analysis

Building the Best Model

Based on model7, I want to add more variables to increase predictive power, specifying a model which has the highest possible Adjusted R2, and therefore which explains as much of the variance in the outcome variable, price of a 4-night stay for 2 at an AirBnB in Copenhagen, as possible.

In the amenities variable, I believe WIFI, TV and Internet should matter most for people in today’s highly digitized society. Further, since most of the properties are Apartments and Houses, instead of Hotels, travelers might also expect to have a kitchen (and be willing to pay for it). Therefore I built 4 logical variables indicating whether the property has these amenities:

listings_necessary <- listings_necessary %>% 
  mutate(has_wifi = grepl('Wifi', amenities),
         has_TV = grepl('TV', amenities),
         has_internet = grepl('Internet', amenities),
         has_kitchen = grepl('Kitchen', amenities))

listings_train <- listings_train %>% 
  mutate(has_wifi = grepl('Wifi', amenities),
         has_TV = grepl('TV', amenities),
         has_internet = grepl('Internet', amenities),
         has_kitchen = grepl('Kitchen', amenities))

listings_test <- listings_test %>% 
  mutate(has_wifi = grepl('Wifi', amenities),
         has_TV = grepl('TV', amenities),
         has_internet = grepl('Internet', amenities),
         has_kitchen = grepl('Kitchen', amenities))

Also, the absolute values of bathroom, bedroom and beds are highly correlated with rooms/beds per capacity (as shown in the Listing-Related coefficient matrix from my EDA), and thus in order to avoid problematic multicollinearity, in order to achieve a higher R2 and ensure model simplicity and interpretibility, I only included the absolute values. Further, my previous maps demonstrated that properties are clustered in neighbourhood by their price, and thus I include my simplified neighbourhood variable, too.

Inspired by previous models, I also added reviews of other matters of the property except review_scores_checkin and review_scores_accuracy, since I do not consider them significant and they were not statistically significant in my preliminary tests. Finally, I added more variables to account for the flexibility of the AirBnB booking process, such as instant_bookable.

Further, since I used price, extra_people, guests_included and cleaning_fee to calculate price_4_nights, it makes no sense to include them in my model. this is obvious: once I know these variables, I can calculate price_4_nights straightaway, which does not allow us to understand the underlying drivers of listing prices in the city.

My Optimised Variable Categories

The variables I used in my optimal model can be categorised as follows:

  1. Facility and Location
  2. Amenities
  3. Super-Host
  4. Property Reviews
  5. Flexibility of Property Booking and Terms
#I therefore build my optimised linear regression model as follows, incorporating variables representing each of these categories:

best_model <- lm(log(price_4_nights) ~
                   
            # Facility and Location     
             prop_type_simplified + 
             room_type + 
             bathrooms + 
             bedrooms + 
             beds + 
             accommodates +  
             neighbourhood_simplified +
              
           # Amenities
             has_wifi +
             has_TV +
             has_internet +
             has_kitchen +

            # Super-Host
             host_is_superhost + 
              
            # Property Reviews
             review_scores_rating +
             review_scores_communication +
             review_scores_location +
             review_scores_value +
             review_scores_cleanliness +
               
            # Flexibility of Property Booking and Terms
             instant_bookable +
             security_deposit +
             cancellation_policy +
             minimum_nights,
            
           data = listings_necessary)

I now summarise the output of my model as before:

# Format and display model summary 
best_model %>% 
  tidy(conf.int = TRUE) %>% 
  mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>% 
    kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
  kable_styling()
Term Estimate Standard Error t-statistic p-value Lower CI Upper CI
(Intercept) 873.446 1.06 111.868 0.000 775.715 983.490
prop_type_simplifiedCondominium 1.019 1.01 1.776 0.076 0.998 1.040
prop_type_simplifiedHouse 0.974 1.01 -1.878 0.060 0.947 1.001
prop_type_simplifiedOthers 1.226 1.02 11.974 0.000 1.186 1.267
prop_type_simplifiedTownhouse 1.010 1.02 0.471 0.638 0.970 1.050
room_typeHotel room 1.580 1.07 7.030 0.000 1.391 1.795
room_typePrivate room 0.690 1.01 -47.425 0.000 0.680 0.701
room_typeShared room 0.560 1.05 -12.874 0.000 0.513 0.612
bathrooms 1.152 1.01 13.549 0.000 1.128 1.175
bedrooms 1.043 1.00 13.527 0.000 1.036 1.049
beds 1.011 1.00 3.811 0.000 1.005 1.016
accommodates 1.111 1.00 38.709 0.000 1.105 1.117
neighbourhood_simplifiedNorth 1.068 1.01 8.842 0.000 1.053 1.084
neighbourhood_simplifiedNorthwest 0.827 1.01 -16.070 0.000 0.808 0.846
neighbourhood_simplifiedSouth 1.042 1.01 3.857 0.000 1.021 1.064
neighbourhood_simplifiedWest 1.091 1.01 9.502 0.000 1.072 1.111
has_wifiTRUE 1.049 1.01 3.495 0.000 1.021 1.077
has_TVTRUE 1.099 1.01 16.337 0.000 1.087 1.111
has_internetTRUE 0.978 1.01 -3.824 0.000 0.967 0.989
has_kitchenTRUE 0.892 1.01 -7.670 0.000 0.866 0.918
host_is_superhostTRUE 1.054 1.01 6.565 0.000 1.038 1.071
review_scores_rating 1.003 1.00 4.102 0.000 1.001 1.004
review_scores_communication 0.969 1.01 -4.957 0.000 0.958 0.981
review_scores_location 1.125 1.00 25.791 0.000 1.115 1.135
review_scores_value 0.920 1.00 -16.957 0.000 0.911 0.929
review_scores_cleanliness 1.042 1.00 10.640 0.000 1.034 1.050
instant_bookableTRUE 1.015 1.01 2.641 0.008 1.004 1.027
security_deposit 1.000 1.00 8.996 0.000 1.000 1.000
cancellation_policymoderate 1.058 1.01 9.553 0.000 1.046 1.071
cancellation_policystrict_14_with_grace_period 1.123 1.01 17.494 0.000 1.108 1.137
cancellation_policysuper_strict_60 1.168 1.03 5.172 0.000 1.101 1.238
minimum_nights 0.986 1.00 -5.321 0.000 0.981 0.991
# Format and display initial model performance analysis
best_model %>% 
  glance() %>% 
  select(1:6) %>% 
    kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
  kable_styling()
Model R-Squared Adjusted R-Squared Sigma t-statistic p-value Degrees of Freedom
0.465 0.464 0.383 658 0 31

I can see that except some factors in property type, all other variables are statistically significant. My model reaches 46.9% adjusted R2, which is a 3% improvement from model7.

Though my model still only explains 46.9% of variance in the price of the Copenhagen stay, it is also 46.9% better than just taking the mean of all properties as prediction, which is a considerable improvement over many alternative, particularly indiscriminate or random, approaches.

Best Model Comparison

I now report a comparison of my first model, Model 1, alongside my final model before optimisation, Model 7, and then finally my optimised model, created according to the rationale and interrogative investigation throughout this report.

huxreg(model1, model7,best_model,
              coefs=c("Property Type | Condominium"="prop_type_simplifiedCondominium",
               "Property Type | House"="prop_type_simplifiedHouse",
               "Property Type | Others"="prop_type_simplifiedOthers",
               "Property Type | Townhouse"="prop_type_simplifiedTownhouse",
               "Number of Reviews"="number_of_reviews", 
               "Review Scores Rating"="review_scores_rating",
               "Room Type | Hotel Room"="room_typeHotel room", 
               "Room Type | Private Room"="room_typePrivate room", 
               "Room Type | Shared Room"="room_typeShared room", 
               "Bathrooms"="bathrooms",
               "Bedrooms"="bedrooms",
               "Beds"="beds",
               "Accommodates"="accommodates",
               "Host is Super Host | True"="host_is_superhostTRUE",
               "Neighbourhood | North"="neighbourhood_simplifiedNorth",
               "Neighbourhood | South"="neighbourhood_simplifiedSouth",
               "Neighbourhood | North West"="neighbourhood_simplifiedNorthwest",
               "Neighbourhood | West"="neighbourhood_simplifiedWest",
               "Cancellation Policy | Moderate"="cancellation_policymoderate",
               "Cancellation Policy | Strict"="cancellation_policystrict_14_with_grace_period",
               "Cancellation Policy | Super Strict"="cancellation_policysuper_strict_60",
               "Has Wifi | True"="has_wifiTRUE",
               "Has TV | True"="has_TVTRUE",
               "Has Internet | True"="has_internetTRUE",
               "Has Kitchen | True"="has_kitchenTRUE",
               "Review Scores Communication"="review_scores_communication",
               "Review Scores Location"="review_scores_location",
               "Review Scores Value"="review_scores_value",
               "Review Scores Cleanliness"="review_scores_cleanliness",
               "Instant Bookable | True"="instant_bookableTRUE",
               "Security Deposit"="security_deposit",
               "Minimum Nights"="minimum_nights"
               ))
(1)(2)(3)
Property Type | Condominium0.074 ***0.022 *  0.019    
(0.016)   (0.011)   (0.011)   
Property Type | House0.251 ***-0.030 *  -0.026    
(0.020)   (0.014)   (0.014)   
Property Type | Others0.468 ***0.223 ***0.204 ***
(0.021)   (0.017)   (0.017)   
Property Type | Townhouse0.349 ***0.010    0.009    
(0.030)   (0.021)   (0.020)   
Number of Reviews0.000    0.000 **         
(0.000)   (0.000)           
Review Scores Rating0.006 ***0.004 ***0.003 ***
(0.001)   (0.000)   (0.001)   
Room Type | Hotel Room        0.585 ***0.457 ***
        (0.066)   (0.065)   
Room Type | Private Room        -0.387 ***-0.371 ***
        (0.008)   (0.008)   
Room Type | Shared Room        -0.591 ***-0.580 ***
        (0.046)   (0.045)   
Bathrooms        0.149 ***0.141 ***
        (0.011)   (0.010)   
Bedrooms        0.042 ***0.042 ***
        (0.003)   (0.003)   
Beds        0.011 ***0.011 ***
        (0.003)   (0.003)   
Accommodates        0.108 ***0.105 ***
        (0.003)   (0.003)   
Host is Super Host | True        0.065 ***0.053 ***
        (0.009)   (0.008)   
Neighbourhood | North        0.065 ***0.066 ***
        (0.008)   (0.007)   
Neighbourhood | South        0.042 ***0.041 ***
        (0.011)   (0.011)   
Neighbourhood | North West        -0.243 ***-0.190 ***
        (0.012)   (0.012)   
Neighbourhood | West        0.088 ***0.087 ***
        (0.009)   (0.009)   
Cancellation Policy | Moderate        0.053 ***0.057 ***
        (0.006)   (0.006)   
Cancellation Policy | Strict        0.123 ***0.116 ***
        (0.007)   (0.007)   
Cancellation Policy | Super Strict        0.295 ***0.155 ***
        (0.029)   (0.030)   
Has Wifi | True                0.048 ***
                (0.014)   
Has TV | True                0.094 ***
                (0.006)   
Has Internet | True                -0.022 ***
                (0.006)   
Has Kitchen | True                -0.114 ***
                (0.015)   
Review Scores Communication                -0.031 ***
                (0.006)   
Review Scores Location                0.118 ***
                (0.005)   
Review Scores Value                -0.083 ***
                (0.005)   
Review Scores Cleanliness                0.041 ***
                (0.004)   
Instant Bookable | True                0.015 ** 
                (0.006)   
Security Deposit                0.000 ***
                (0.000)   
Minimum Nights                -0.014 ***
                (0.003)   
N17655        23539        23539        
R20.047    0.434    0.465    
logLik-13053.633    -11431.993    -10777.786    
AIC26123.267    22909.985    21621.572    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Best Model Diagnostics

#I evaluate whether the 4 core assumptions of linear regression (Linear, Independence, Normal, Equality) hold
autoplot(best_model, alpha = 0.3, label.size = 3) +
  theme_bw()

As shown in graph above, I observe less pattern in the residuals vs fitted plot compared with previous models, indicating that the addition of variables increased the amount of variation in price_4_nights captured by my model. Similar to previous models, I can conclude that my residuals follow approximately a normal distribution, as the normal Q-Q plot suggests. Again, in terms of scale-location, the equal (or constant) variance assumption also seems to hold, as there is no clear positive nor negative trends across the fitted values. Finally, a number of observations with high absolute residuals indicate that some observations do have undue influence on estimates of model parameters, though this require further investigation which is beyond the remits of the present analysis.

#I use the Variance Inflation Factor (VIF) to evaluate if multicollinearity is problematic for my model
kbl(vif(best_model), align="l") %>%
kable_styling(full_width = F)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.04 4 1.09
room_type 1.49 3 1.07
bathrooms 1.27 1 1.13
bedrooms 1.74 1 1.32
beds 2.42 1 1.56
accommodates 3.01 1 1.74
neighbourhood_simplified 1.10 4 1.01
has_wifi 1.05 1 1.03
has_TV 1.14 1 1.07
has_internet 1.07 1 1.03
has_kitchen 1.14 1 1.07
host_is_superhost 1.05 1 1.02
review_scores_rating 2.55 1 1.60
review_scores_communication 1.45 1 1.21
review_scores_location 1.35 1 1.16
review_scores_value 2.04 1 1.43
review_scores_cleanliness 1.94 1 1.39
instant_bookable 1.09 1 1.04
security_deposit 1.13 1 1.06
cancellation_policy 1.83 3 1.11
minimum_nights 1.10 1 1.05

With all VIFs < 5, it is safe to conclude that my best model does not suffer from collineraity.

rmse_train<- 
  listings_train %>% 
  mutate(predictions = exp(predict(best_model, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

rmse_test <-
  listings_test %>% 
  mutate(predictions = exp(predict(best_model, .))) %>% 
  summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>% 
  pull()

r2_train <-
  summary(best_model)$r.squared

r2_test <-
  cor(predict(best_model, listings_test), listings_test$price_4_nights)

kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
           nrow = 2,
           dimnames = list(c("train","test"),
                           c("RMSE","R-squared"))),
  ) %>% 
  kable_styling()
RMSE R-squared
train 36521 0.465
test 36745 0.246

In comparing the RMSE and R2 of my final Best Model when applied to Test and Training data, I find that there is a very small difference in RMSE, and as would be expected, the RMSE is lower for the training than the test dataset. Not only is the difference in RMSE the lowest of all the models I have evaluated throughout this analysis, but it is a difference of only 0.60% in the context of the error figures for each model. This indicates that the model is not as overfitted to the training data as the substantially different R2 values would suggest. Indeed, despite the obvious presence of overfitting, I can safely conclude that my Best Model does perform the best of all models surveyed in terms of model fit (as well as in R2), which is reassuring.

Best Model Prediction

Now, I predict the price to live in an apartment with a private room, which has at least 10 reviews, and an average rating of at least 90, for 2 guests over the course of a 4 night stay in Copenhagen

# I select: an apartment with a private room, which has at least 10 reviews, and an average rating of at least 90,  to conduct my prediction
subset_of_listings <- listings_necessary %>% 
  filter(property_type == 'Apartment',
         room_type == 'Private room',
         review_scores_rating >= 90,
         number_of_reviews >= 10) %>% 
  
  # I take the exponential for prediction because I used log(y) in my model
  mutate(prediction = exp(predict(best_model, .))) 
# I construct a confidence interval
upper_ci <- quantile(subset_of_listings$prediction, 0.975)
lower_ci <- quantile(subset_of_listings$prediction, 0.025)

# I calculate the mean
mean_pred <- mean(subset_of_listings$prediction)

# I visualise
ggplot(subset_of_listings, aes(x = prediction)) +
  geom_histogram() +
  geom_vline(xintercept = mean_pred, size = 0.5, color = 'red') + 
  geom_vline(xintercept = upper_ci, size = 0.5, color = 'blue') +
  geom_vline(xintercept = lower_ci, size = 0.5, color = 'blue') +
  annotate(geom = 'text', label = paste("Point Prediction:", round(mean_pred)), x = mean_pred + 50, y = 35, colour="white", angle=270) +
  annotate(geom = 'text', label = paste("Upper 95% CI:", round(upper_ci)), x = upper_ci + 50, y = 35, colour="black", angle=270) +
  annotate(geom = 'text', label = paste("Lower 95% CI:", round(lower_ci)), x = lower_ci+ 50, y = 35, colour="black", angle=270) +
  theme_fivethirtyeight() +
  labs(y = "count", x = "Price: 2 Guests for 4 Nights", 
       title = 'Point Estimate and Confidence Interval for the Price of a Stay in a Copenhagen AirBnB',
       subtitle = 'Histogram of estimated price for a 4-Night Stay for 2 Guests in an AirBnB Private Room in an Apartment, rated at least 90 with at least 10 reviews') +
  scale_x_continuous(labels = label_dollar())

As the histogram shows, I conclude that my point estimate in an apartment with private room, overall rating of at least 90 and with 10 or more reviews is $1949 USD, with a 95% confidence interval of $1309 to $2597 USD.